← Glossary / Data Contract

What is Data Contract?

Data contract is a formal, versioned agreement between a data producer (like a scraping pipeline) and a downstream consumer regarding the schema, semantics, and quality of the delivered dataset. It moves validation upstream, ensuring that structural drift—like a price field changing from an integer to a string—is caught and quarantined before it pollutes the data warehouse. Without a contract, your pipeline is just throwing JSON over the wall and hoping the analytics team doesn't notice the breakage.

Data EngineeringSchema ValidationData QualityGovernanceETL
// 02 — definitions

Promises made,
promises kept.

The shift from implicit assumptions about scraped data to explicit, enforceable rules that govern pipeline output.

Ask a DataFlirt engineer →

TL;DR

A data contract defines the exact shape, type, and completeness thresholds of a dataset. Instead of downstream dbt models failing silently when a target site changes its DOM, the extraction layer validates every record against the contract. If the contract is breached, the pipeline halts or quarantines the data, preventing bad data from entering the lakehouse.

01Definition & structure
A data contract is an API-like agreement between data producers and consumers. In the context of web scraping, it defines exactly what a valid extracted record must look like before it is allowed to enter the downstream data warehouse. A robust contract specifies:
  • schema — exact field names, nesting structures, and data types
  • constraints — value ranges, regex patterns, and allowed enums
  • completeness — which fields are strictly required vs. nullable
  • SLAs — delivery frequency, volume expectations, and freshness
It replaces implicit trust with explicit, automated verification.
02How it works in practice
Instead of extracting data and writing it directly to a database, the extraction worker passes the JSON payload through a validation layer. This layer checks the payload against the contract. If it passes, it proceeds to the delivery sink. If it fails, the record is flagged with a specific violation code (e.g., type_error, missing_required_field) and routed to a dead-letter queue for inspection. This ensures downstream pipelines never crash due to unexpected data shapes.
03Schema evolution and versioning
Target websites change constantly, meaning the data you extract will inevitably change. Data contracts handle this through strict versioning. If a target site adds a new, valuable field, you don't just silently append it to the output. You draft a new version of the contract (e.g., v2.0), update the downstream consumers to expect the new field, and then switch the pipeline to enforce the new version. This prevents "schema drift" from breaking downstream dashboards.
04How DataFlirt handles it
We treat data contracts as first-class citizens. Every pipeline we build is bound to a JSON Schema contract stored in our central registry. Validation happens in-memory at the edge, immediately after extraction. If a target site pushes a redesign that breaks our selectors, the contract validation fails instantly, quarantining the bad records and alerting our engineers. We fix the selector, re-run the extraction, and deliver clean data. Your warehouse never sees the breakage.
05The silent failure it prevents
Without a contract, the most dangerous failures are silent. If a site changes its price format from 49.99 to Contact for Price, a naive scraper will extract the string. If your database column accepts strings, it writes successfully. Downstream, an automated pricing algorithm tries to calculate an average, hits the string, and crashes the entire nightly ETL job. A data contract catches the string at the source, preventing the cascading failure.
// 03 — the contract model

How strict is
the agreement?

Data contracts are evaluated on coverage, strictness, and breach rate. DataFlirt monitors these metrics per pipeline to ensure downstream consumers never ingest malformed records.

Contract Coverage = C = fields_validated / fields_extracted
A score of 1.0 means every extracted field is bound by a strict rule. Data Engineering SLOs
Breach Rate = B = records_quarantined / total_records
High breach rates indicate target site drift or overly strict bounds. Pipeline Observability
DataFlirt Quality Score = Q = 1 − (B × severity_weight)
Q > 0.99 is required for automated S3 delivery to proceed. Internal Delivery Gate
// 04 — validation trace

Enforcing the contract
at the edge.

A live trace of an extraction worker validating scraped e-commerce records against a versioned data contract before writing to the delivery sink.

JSON SchemaStrict ModeQuarantine
edge.dataflirt.io — live
CAPTURED
// contract initialization
schema.load: "s3://df-contracts/retail/v4.1.json"
schema.status: active // strict mode enabled

// record validation stream
record.id: "sku_88412"
field.price: 49.99 // type: float -> pass
field.currency: "USD" // enum: [USD, EUR, GBP] -> pass
field.stock_status: "In Stock" // type: boolean -> FAIL

// breach handling
contract.violation: type_mismatch // expected boolean, got string
action: "quarantine_record"
alert.route: "slack_data_eng"

// batch summary
records.processed: 150,000
records.passed: 149,982
records.quarantined: 18
batch.status: cleared // breach rate 0.012% < threshold
// 05 — breach vectors

Why contracts
actually fail.

The most common reasons scraped data breaches its contract, ranked by frequency across DataFlirt's managed pipelines. Target site UI updates drive the vast majority of schema drift.

PIPELINES MONITORED ·   300+ active
VALIDATION LAYER ·  ·  ·  in-memory
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Type coercion failures

% of breaches · String vs Int, usually due to new UI text
02

Missing required fields

% of breaches · DOM changes breaking CSS selectors
03

Enum violations

% of breaches · Target site added a new category
04

Semantic drift

% of breaches · Price now includes tax, breaking ranges
05

Format changes

% of breaches · Date string altered from US to EU format
// 06 — enforcement layer

Validate at the edge,

quarantine before the warehouse.

A contract is useless if it's only checked after the data lands in Snowflake. DataFlirt enforces data contracts directly at the extraction worker level. Every JSON record is validated in memory against its versioned schema. If a target site pushes a redesign that changes how prices are formatted, the worker catches the type mismatch instantly, quarantines the affected records, and alerts our engineering team—all before a single byte reaches your S3 bucket.

Contract Enforcement Status

Live state of the validation layer on a high-volume retail pipeline.

contract.id retail_pricing_v4
enforcement.mode strict
schema.registry active
type_checks enforced
null_constraints enforced
quarantine.queue 14 records
delivery.status blocked_until_resolved

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About schema validation, contract ownership, handling drift, and how DataFlirt guarantees data quality at scale.

Ask us directly →
What is the difference between a data contract and a schema? +
A schema just defines the shape of the data (e.g., "price is a float"). A data contract includes the schema, but adds semantics, SLAs, completeness thresholds, and enforcement mechanisms. It is an agreement on what happens when the schema is violated, not just a description of the fields.
Who owns the data contract in a scraping pipeline? +
The data consumer (your analytics or engineering team) defines the contract based on their downstream requirements. The data producer (DataFlirt) owns the enforcement of that contract. We ensure that nothing leaves our infrastructure unless it strictly adheres to your defined rules.
How does DataFlirt handle a contract breach? +
When a record fails validation, it is immediately routed to a quarantine queue. If the breach rate exceeds a predefined threshold (e.g., 1%), the delivery batch is halted, and an on-call engineer is paged to patch the extraction logic. Once fixed, the quarantined records are re-processed and backfilled.
Can a contract handle optional fields? +
Yes. Fields that are expected to be occasionally missing are explicitly defined as nullable in the contract. However, we still track the null-rate. If an optional field suddenly goes from 5% null to 95% null, it triggers a semantic drift alert even if it technically passes the type check.
Is it better to drop bad records or halt the pipeline? +
It depends on your use case and the breach volume. For massive datasets where 99.9% completeness is acceptable, dropping or quarantining the 0.1% of malformed records is standard. If the breach rate spikes to 5%, it indicates a systemic selector failure, and halting the pipeline is safer than delivering heavily skewed data.
Does enforcing contracts slow down extraction? +
Negligibly. We use highly optimized, in-memory JSON schema validators at the worker level. Validating a complex record against a strict contract adds less than 1 millisecond of overhead per record. The cost of validation is vastly cheaper than the cost of untangling corrupted data in your warehouse.
$ dataflirt scope --new-project --target=data-contract READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h