← Glossary / Array Field Expansion

What is Array Field Expansion?

Array field expansion is the extraction step where a single DOM node or JSON string containing multiple values—like product sizes, tags, or nested categories—is parsed and split into a structured list of distinct elements. Leaving these as concatenated strings breaks downstream analytics. Proper expansion ensures that a product available in five colors becomes queryable as five distinct attributes, not one opaque text block.

Data TransformationParsingSchema DesignETLNormalization
// 02 — definitions

One string,
many values.

The mechanics of turning concatenated text blobs into queryable, typed arrays before they hit your data warehouse.

Ask a DataFlirt engineer →

TL;DR

Array field expansion converts raw, multi-value strings (like "S, M, L, XL") or repeating DOM elements into structured arrays. It's a critical normalisation step. Without it, downstream consumers can't filter, group, or join on individual attributes because the data is trapped in a single scalar field.

01Definition & structure
Array field expansion is the process of taking a single extracted value that contains multiple logical items and splitting it into a structured list. This typically applies to fields like categories, tags, available sizes, or image URLs. The source might be a single comma-separated string, or a series of repeating HTML elements (like multiple <li> tags within a list). The output is a typed array that matches the data contract.
02HTML vs JSON sources
When extracting from JSON APIs, arrays are usually already structured—you just pass them through. When extracting from HTML, you must build the array yourself. This means either selecting a parent node and iterating over its children, or selecting a single text node and applying a string split operation based on a known delimiter.
03The delimiter problem
String splitting is fragile. If you split on commas, a category named "Home, Garden & Tools" will incorrectly expand into two elements. Robust array expansion requires smart delimiters (e.g., splitting only on commas followed by a specific padding) or, preferably, extracting the individual elements directly from the DOM structure rather than relying on text parsing.
04How DataFlirt handles it
We enforce array expansion at the extraction layer. Our schema engine validates not just that the field is an array, but that every element inside the array matches the expected type and format. If an array expansion fails (e.g., returning a single un-split string due to a delimiter change), our completeness monitors flag the anomaly and quarantine the record before it reaches the client's data warehouse.
05The cardinality explosion risk
If you choose to flatten arrays into separate rows during extraction (denormalization), a single page scrape can suddenly yield hundreds of records. A product with 10 sizes and 5 colors becomes 50 rows. This inflates pipeline throughput metrics artificially and complicates deduplication. It is almost always better to deliver nested arrays and let the consumer unnest them at query time.
// 03 — expansion metrics

Measuring array
completeness.

Array expansion introduces variable record sizes. DataFlirt tracks the expansion ratio to detect when a site changes its delimiter or hides array elements behind a 'Show More' button.

Expansion Ratio = E = expanded_elements / source_records
Tracks average array size per record. Sudden drops indicate hidden elements or broken selectors. Extraction monitoring
Delimiter Failure Rate = F = unsplit_strings / total_arrays
Detects when a site switches from commas to pipes, causing the split function to fail silently. DataFlirt schema validation
Element Type Validity = V = valid_types / total_elements
Ensures every expanded element matches the expected schema type (e.g., all integers). DataFlirt extraction SLO
// 04 — extraction trace

Splitting strings
at the edge.

A live trace of an extraction worker parsing a B2B hardware catalog, expanding a single specifications string into a typed array of integers.

JSON outputregex splittype coercion
edge.dataflirt.io — live
CAPTURED
// raw input from DOM
dom.specs: "Available Sizes: 10mm | 12mm | 15mm | 20mm"

// parsing & extraction
regex.match: "Sizes: (.*)"
string.split: " | "

// array expansion & coercion
array[0]: "10mm" -> coerce(int) -> 10
array[1]: "12mm" -> coerce(int) -> 12
array[2]: "15mm" -> coerce(int) -> 15
array[3]: "20mm" -> coerce(int) -> 20

// validation
schema.check: array[int] pass
output.write: [10, 12, 15, 20]
// 05 — failure modes

Why arrays
fail to expand.

Ranked by frequency across DataFlirt's extraction pipelines. Delimiter changes are the most common cause of silent data degradation, leaving arrays trapped as single strings.

PIPELINES MONITORED ·   300+ active
ARRAY FIELDS ·  ·  ·  ·   ~18% of schema
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Delimiter mutation

% of failures · Site changes from comma to pipe or semicolon
02

Hidden elements

% of failures · Elements truncated behind a 'Show More' UI
03

Inconsistent spacing

% of failures · Padding variations break strict split logic
04

Mixed data types

% of failures · Strings mixed into integer arrays break coercion
05

Empty arrays parsed as null

% of failures · Breaks downstream schema expectations
// 06 — our architecture

Expand early,

validate every element.

DataFlirt performs array field expansion at the extraction layer, not in post-processing. By expanding early, we can apply schema validation to every individual element. If a site accidentally includes a string in an array of prices, the record is quarantined immediately. Late expansion hides type errors until they break a downstream dashboard.

Array Expansion Job

Live status of an array expansion step on an e-commerce pipeline.

job.id expand-arr-042
field.target product.variants
delimiter regex(/,\s*/)active
elements.extracted 48,192
type.coercion string -> uuidstrict
validation.failed 14 elements
pipeline.status running nominally

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About array expansion, delimiter handling, schema validation, and how DataFlirt ensures clean multi-value data.

Ask us directly →
Why not just store the string and split it in SQL later? +
Because late expansion hides data quality issues. If you store "10, 12, N/A, 20" as a string, your pipeline reports success. When the data engineer tries to cast that split array to integers in Snowflake, the query fails. Expanding at the extraction layer catches type coercion errors immediately, allowing the scraper to quarantine the bad record or trigger a fallback.
How do you handle arrays where elements have different types? +
We don't. A well-designed schema enforces uniform types within an array. If a site mixes types (e.g., an array of sizes where one element is "Out of Stock"), the extraction layer must filter the invalid element, map it to a separate status field, or quarantine the record. Mixed-type arrays are an anti-pattern in structured data delivery.
What happens if the array is empty? +
It should be returned as an empty array [], not as null or an empty string "". Maintaining the array type contract is critical for downstream consumers. If a field is defined as an array in the schema, it must always be an array, even if it contains zero elements.
How does DataFlirt handle 'Show More' buttons that hide array elements? +
If the hidden elements are present in the initial DOM (just hidden via CSS), we extract them normally. If they require a network request to fetch, our extraction workers intercept the underlying API call or trigger the click event in a headless browser to ensure the array is fully populated before the record is written.
Should I flatten arrays into separate rows (denormalization)? +
Usually, no. Keep the array intact during extraction and delivery. Denormalizing (exploding one product with five colors into five separate rows) inflates record counts and makes deduplication harder. Deliver the data as a nested JSON or Parquet array, and let the data warehouse handle the unnesting via FLATTEN or EXPLODE functions when querying.
How do you detect if a delimiter changes? +
Through expansion ratio monitoring. If a field historically yields an average of 4.2 elements per record, and suddenly drops to 1.0, it means the split function failed—usually because the site changed its delimiter from a comma to a pipe. DataFlirt's schema validation catches this anomaly and alerts the pipeline operator before the data is delivered.
$ dataflirt scope --new-project --target=array-field-expansion READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h