← Glossary / Custom Extraction Schema

What is Custom Extraction Schema?

A custom extraction schema is a strict, client-defined data contract that dictates exactly how raw scraped content is parsed, typed, and structured before delivery. Instead of generic key-value dumps, the pipeline enforces specific field names, nested arrays, and data types (e.g., coercing a price string into a float). Without it, you're just dumping raw strings into a data lake and praying the downstream analytics team can parse them.

Data ContractETLValidationJSON SchemaData Delivery
// 02 — definitions

Define the
contract.

How we map chaotic, unstructured web data into rigid, typed records that your data warehouse can ingest without breaking.

Ask a DataFlirt engineer →

TL;DR

A custom extraction schema is a strict definition of the output format for a scraping pipeline. It enforces field names, data types, nullability, and nested structures at the point of extraction, ensuring that schema drift on the target website doesn't poison your downstream database with malformed data.

01Definition & structure
A custom extraction schema is a formal definition of the exact data structure a scraping pipeline must produce. It specifies the keys, the expected data types (string, integer, boolean, array), required vs. optional fields, and formatting rules (like regex patterns for dates). It acts as a strict boundary between the messy reality of the web and the clean requirements of a database.
02How it works in practice
During extraction, the scraper pulls raw text from the DOM. Before that data is yielded, a transformation layer coerces the raw strings into the types defined by the schema. A validation engine (like AJV for JSON Schema) then checks the record. If it passes, it's queued for delivery. If it fails, it's dropped into a dead-letter queue or quarantine for human review.
03Type coercion and validation
Extraction schemas do more than just check types; they actively clean data. A good schema definition includes coercion rules: stripping currency symbols before casting to a float, converting "In Stock" to a boolean true, or splitting comma-separated strings into proper JSON arrays. This pushes the ETL workload to the edge.
04How DataFlirt handles it
We build the schema first, before writing a single line of extraction code. Our clients provide a sample JSON or Parquet schema, and we map our selectors directly to it. Every record is validated in-memory by the worker. If a target site updates and breaks a selector, the schema validation catches the resulting null or type mismatch instantly, preventing bad data from ever reaching the client's S3 bucket.
05The cost of generic schemas
Many off-the-shelf scraping tools output generic schemas (e.g., {"url": "...", "data": "..."}). This creates massive technical debt. The data engineering team must write complex, brittle SQL transformations to parse the generic payload into usable tables. A custom extraction schema eliminates this step entirely, delivering analytics-ready data from day one.
// 03 — schema validation

How strict is
the contract?

A schema is only useful if it's enforced. DataFlirt measures schema compliance on every record before it hits the delivery queue, quarantining anything that fails the contract.

Schema compliance rate = C = valid_records / total_extracted
A drop below 99.9% triggers an automated pipeline halt and selector review. DataFlirt extraction SLO
Field density = D = populated_fields / (total_fields × records)
Measures how often optional fields are actually present in the payload. Data Quality Metrics
Type error rate = E = coercion_failures / total_fields
High E indicates selector rot or a target site redesign. Validation Layer
// 04 — validation trace

Enforcing the schema
at runtime.

A live trace of a product record being extracted, validated against a custom JSON schema, and routed based on compliance.

JSON SchemaType CoercionQuarantine
edge.dataflirt.io — live
CAPTURED
// extraction output (raw)
raw.price: "₹ 1,299.00"
raw.in_stock: "Yes"
raw.variants: "Red, Blue"

// apply custom schema coercion
coerce.price: 1299.00 // float
coerce.in_stock: true // boolean
coerce.variants: ["Red", "Blue"] // array

// schema validation (ajv strict mode)
check.required_fields: PASS
check.type_match: PASS
check.enum_match: FAIL // "Blue" not in allowed variant list

// routing
status: QUARANTINED
action: "route to manual review queue"
// 05 — schema drift

Why schemas
fail validation.

The most common reasons a scraped record violates its custom extraction schema, based on DataFlirt's quarantine logs across 300+ active pipelines.

RECORDS VALIDATED ·  ·    850M / month
QUARANTINE RATE ·  ·  ·   0.42%
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Type mismatch

41% of failures · String extracted where float expected
02

Missing required field

28% of failures · Selector rot or DOM change
03

Format regex failure

16% of failures · Date or currency format changed
04

Enum violation

10% of failures · Unexpected category or status value
05

Array bounds exceeded

5% of failures · Too many/few nested elements
// 06 — our architecture

Extract to contract,

never to generic key-value pairs.

DataFlirt treats your custom extraction schema as a hard dependency. We compile your JSON Schema or Avro definition directly into our extraction workers. If a target site changes its pricing format from a number to an image, the record fails validation at the edge and enters quarantine. We never silently pass nulls or malformed strings to your S3 bucket. You get exactly the data shape you asked for, or you get an alert.

schema.config.json

Worker configuration for a custom schema enforcement.

schema.engine JSON Schema Draft 2020-12
strict_mode true
coerce_types true
remove_additional true
on_validation_fail quarantine
alert_threshold > 0.1% failure rate

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about defining, enforcing, and updating custom extraction schemas in production scraping pipelines.

Ask us directly →
Why not just extract everything as strings and parse it later? +
Because downstream parsing creates silent failures. If a site changes "₹1,299" to "Contact for Price", a string-based pipeline happily delivers it. Your analytics dashboard then breaks when it tries to sum the column. Enforcing types at extraction catches the error immediately and quarantines the record.
What formats can DataFlirt output to? +
We support custom schemas mapped to JSON, NDJSON, CSV, Apache Parquet, and Avro. For Parquet and Avro, the schema definition is strictly enforced by the serialization library itself, ensuring absolute compatibility with your data warehouse.
How do you handle optional fields that aren't always present? +
We define them as nullable or optional in the schema. If the field is missing, the worker explicitly outputs a null (not an empty string or a missing key), preserving the structural integrity of the record for downstream ingestion.
What happens when a target website redesigns its layout? +
The selectors will likely pull incorrect data, which will fail the schema validation (e.g., pulling a description string into a price float field). The records enter quarantine, our monitoring alerts the engineering team, and we patch the selectors. Your downstream database never sees the bad data.
Can I update my custom schema after the pipeline is live? +
Yes, but it requires a version bump. We treat schema changes as breaking API changes. We deploy the new schema version, migrate the extraction logic, and can optionally backfill historical data to match the new schema shape.
Do custom schemas cost more to process? +
No. Schema validation adds negligible CPU overhead (microseconds per record). The real cost is in the engineering time to define the contract upfront, which pays for itself tenfold by eliminating downstream data cleaning tasks.
$ dataflirt scope --new-project --target=custom-extraction-schema READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h