← Glossary / Product Variant Scraping

What is Product Variant Scraping?

Product variant scraping is the extraction of distinct SKUs—differing by size, color, material, or packaging—that share a single parent product page. Because modern e-commerce platforms load variant data dynamically via JavaScript or hidden JSON blobs rather than separate URLs, naive scrapers often capture only the default selection. Missing variants means missing up to 80% of a catalog's actual inventory, skewing pricing models and competitive intelligence pipelines.

E-commerceJSON ExtractionDynamic DOMSKU MappingData Completeness
// 02 — definitions

Beyond the
default SKU.

The mechanics of extracting the full matrix of product options when the URL doesn't change and the DOM only shows what's currently selected.

Ask a DataFlirt engineer →

TL;DR

Product variant scraping requires parsing embedded JSON state or intercepting XHR requests rather than clicking through DOM elements. Platforms like Shopify and Magento inject the entire variant matrix into the page source on initial load. Extracting this directly is 50x faster and infinitely more reliable than simulating user clicks with Playwright.

01Definition & structure
A product variant is a specific, purchasable version of a product (a SKU) that exists alongside other versions on a single parent product page. Variants are defined by a matrix of options—typically size, color, material, or pack quantity. In modern web architecture, selecting a variant rarely triggers a full page reload; instead, JavaScript updates the DOM with the new price, image, and stock status.
02How it works in practice
To render a seamless user experience, e-commerce platforms inject the entire dataset for all variants into the initial HTML response, usually as a JSON object inside a <script> tag. When a user clicks "Large" and "Blue", the frontend framework queries this local JSON object and updates the UI. Efficient scraping bypasses the UI entirely, locates the JSON object, parses it, and iterates over the variant array to yield individual records.
03The click-and-scrape anti-pattern
A common mistake among junior engineers is using headless browsers to physically click every combination of swatches and dropdowns, scraping the DOM after each click. This approach is computationally expensive, highly susceptible to layout changes, and prone to race conditions where the scraper reads the DOM before the JavaScript has finished updating the price. It also completely misses out-of-stock variants that the UI hides.
04How DataFlirt handles it
We treat product pages as data containers, not visual interfaces. Our extraction layer uses AST (Abstract Syntax Tree) parsing and regex to isolate the frontend state objects (like Next.js __NEXT_DATA__ or Shopify's meta.product). We extract the raw JSON, map the variant nodes to our standardized schema, and yield the full SKU matrix in milliseconds. If pricing requires a separate API call, we intercept the XHR request directly.
05Did you know?
Even if a website doesn't update the URL in the browser address bar when you select a variant, almost all major e-commerce platforms support direct variant linking via query parameters (e.g., ?variant=123456). Appending these IDs during the extraction phase ensures your downstream database has a globally unique, verifiable URL for every single SKU.
// 03 — the extraction math

Calculating the
variant matrix.

A product page with 4 colors and 5 sizes has a theoretical matrix of 20 variants. DataFlirt's extraction engine maps this theoretical matrix against the actual valid SKUs exposed in the frontend state object.

Theoretical Matrix Size = Vmax = O1 × O2 × ... × On
Total possible combinations of all available option types. Combinatorics
Extraction Latency (DOM vs JSON) = Tdiff = (Vactual × tclick) − tparse
Clicking through variants takes O(N) time. Parsing the JSON state takes O(1). DataFlirt performance baseline
Variant Completeness Score = C = SKUs_extracted / SKUs_in_state_object
Must equal 1.0. Anything less indicates a silent extraction failure. DataFlirt extraction SLO
// 04 — state object extraction

Bypassing the DOM
for the raw data.

A trace of DataFlirt's extraction worker hitting a Shopify product page. Instead of rendering the page and clicking swatches, we locate the embedded JSON state and extract all 12 variants in a single pass.

AST parsingJSON extractionZero-click
edge.dataflirt.io — live
CAPTURED
// fetch initial HTML
GET https://target-store.com/products/merino-wool-sweater
status: 200 OK bytes: 142,048

// locate frontend state object
regex.match: "var meta = \{product: (.*?)\};"
json.parse: success

// extract variant matrix
product.id: 892341102
product.options: ["Color", "Size"]
variants.found: 12

// validate specific SKU
sku[4].id: "MW-SW-NVY-L"
sku[4].price: 129.00
sku[4].available: false // out of stock, hidden in DOM

// output generation
records.yielded: 12
extraction.time: 14ms // 0 browser clicks required
// 05 — failure modes

Where variant
extraction breaks.

Ranked by frequency across DataFlirt's e-commerce pipelines. Relying on the visual DOM instead of the underlying data model is the root cause of almost all variant scraping failures.

PIPELINES MONITORED ·   180+ retail
SKUS PROCESSED ·  ·  ·    45M/day
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Out-of-stock variants omitted

% of failures · DOM hides unavailable sizes; JSON keeps them
02

Dynamic pricing via XHR

% of failures · Base HTML shows range, exact price needs API call
03

Invalid option combinations

% of failures · Matrix math assumes SKUs that don't exist
04

Image gallery mismatch

% of failures · Failing to map specific images to specific SKUs
05

URL canonicalization

% of failures · ?variant=123 stripped by routing logic
// 06 — our architecture

Parse the state,

ignore the DOM.

DataFlirt's extraction engine doesn't click buttons. When we hit a product page, we look for the underlying state object—Next.js props, Shopify variant arrays, or Magento configuration blobs. By extracting the raw data model that the frontend framework uses to render the page, we capture the entire SKU matrix in a single pass. Zero clicks, zero render overhead, absolute completeness.

variant-extractor.log

Live extraction metrics for a complex apparel product page.

target.url /products/tech-jacket
extraction.method AST JSON parse
options.detected Color (4), Size (6)
matrix.theoretical 24 combinations
skus.actual 22 valid SKUs
skus.out_of_stock 3 SKUs
completeness 1.0

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about extracting complex product catalogs, handling dynamic pricing, and ensuring data completeness.

Ask us directly →
Why not just use Playwright to click each color and size? +
It's too slow and incredibly brittle. Clicking 24 combinations takes roughly 15 seconds of network idle waiting and UI rendering. If a pop-up intercepts a click, the job fails. Parsing the embedded JSON state object takes 15 milliseconds and yields the exact same data with zero UI dependency.
How do you handle variants that require a separate network request for pricing? +
Some enterprise platforms (like Salesforce Commerce Cloud) load base product data in HTML but fetch variant pricing via a separate XHR request to an inventory API. We intercept the API endpoint directly, extract the bearer token from the initial load, and request the pricing JSON for all variants concurrently.
What if a variant combination doesn't actually exist? +
A theoretical matrix of 4 colors and 6 sizes implies 24 SKUs, but the brand might not manufacture the 'XXL' in 'Neon Yellow'. If you generate URLs blindly, you'll scrape 404s. We extract the explicit list of valid SKUs from the frontend state object, ensuring we only yield records for products that actually exist in the catalog.
How does DataFlirt ensure variant images are mapped correctly? +
Image mapping is notoriously difficult if you rely on the DOM, as galleries often lazy-load. We map the image_id or media_reference found inside the variant's JSON node directly to the master media array in the same state object. This guarantees the red jacket record gets the red jacket image URL.
How do you handle URLs that don't update when a variant is selected? +
Many SPAs update the DOM without pushing a new URL to the history API. We generate synthetic URLs for our delivery payload by appending the platform's standard variant query parameter (e.g., ?variant=892341102). This ensures your downstream systems have a unique, addressable primary key for every SKU.
Can you extract variants that are completely hidden because they are out of stock? +
Yes. Frontend code typically filters out-of-stock variants from the UI dropdowns to improve user experience. However, the data for those variants is almost always still present in the initial JSON payload sent to the browser. By extracting the JSON, we capture the full historical catalog, not just what's purchasable today.
$ dataflirt scope --new-project --target=product-variant-scraping READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h