← Glossary / Schema Evolution

What is Schema Evolution?

Schema evolution is the controlled process of adapting data structures over time without breaking downstream consumers or losing historical context. In scraping pipelines, target websites change their DOM and API payloads constantly. Evolution dictates how those upstream changes—new fields, renamed keys, altered types—are safely propagated into your data warehouse. Without a formal evolution strategy, schema drift causes silent data loss and pipeline failures.

Data EngineeringData ContractsETLSchema RegistryBackward Compatibility
// 02 — definitions

Change is
inevitable.

How data engineering teams handle the reality that upstream sources will alter their formats, and downstream consumers still expect unbroken pipelines.

Ask a DataFlirt engineer →

TL;DR

Schema evolution manages structural changes to data over time. It defines rules for backward, forward, and full compatibility. In web scraping, where you don't control the source schema, evolution is defensive: it ensures that when a target site adds a new pricing tier or changes a date format, your pipeline adapts gracefully rather than crashing or writing nulls.

01Definition & structure
Schema evolution is the formal management of changes to data structures over time. In a scraping context, it defines how your pipeline reacts when a target website alters its DOM or API response. A robust evolution strategy relies on a schema registry and strict compatibility rules to ensure that changes (like adding a field, removing a field, or changing a type) do not break downstream ETL processes or data warehouse ingestion.
02Compatibility modes
Evolution is governed by compatibility modes. Backward compatibility means consumers using the new schema can read data written with the old schema (you can add optional fields, but can't delete required ones). Forward compatibility means consumers using the old schema can read data written with the new schema (you can delete fields, but can't add required ones). Full compatibility guarantees both.
03Scraping vs internal databases
In internal software engineering, you control both the database schema and the application code, so you can migrate them together. In web scraping, you do not control the source. The target website will change its schema without warning. Therefore, schema evolution in scraping is inherently defensive—it is about mapping an unpredictable upstream reality into a stable, predictable downstream contract.
04How DataFlirt handles it
We enforce schema-on-write at the edge. Every extracted record is validated against a versioned schema contract in our registry. If a target site drifts, records failing validation are routed to a quarantine queue. Our engineers review the drift, update the schema using backward-compatible aliases or default values, bump the version, and replay the quarantined records. Your data warehouse never sees a broken row.
05The silent failure of type coercion
The most dangerous schema changes aren't missing fields; they are silent type changes. If a target site changes a numeric price field to a string like "Call for pricing", a naive scraper will extract it, and the pipeline will crash when the data warehouse tries to cast it to an integer. Proper schema evolution catches this at the extraction layer, flagging the type mismatch before it poisons the dataset.
// 03 — the compatibility matrix

How breaking
is a change?

Schema evolution relies on strict compatibility rules. DataFlirt's registry evaluates every incoming schema change against these constraints before allowing it to merge into the production pipeline.

Backward Compatibility = Cback = Vnew can read Dold
Consumers using the new schema can read data written by the old schema. Deleting a field breaks this. Avro/Protobuf standards
Forward Compatibility = Cfwd = Vold can read Dnew
Consumers using the old schema can read data written by the new schema. Adding a required field breaks this. Avro/Protobuf standards
Schema Drift Rate = ΔS = (fields_added + fields_dropped) / total_fields
Monitored per target. High drift triggers manual review and contract renegotiation. DataFlirt pipeline metrics
// 04 — schema registry trace

Merging a breaking
upstream change.

A target e-commerce site splits its single price string into a numeric value and a currency code. Here is how the schema registry handles the transition without breaking downstream consumers.

AvroSchema RegistryCI/CD
edge.dataflirt.io — live
CAPTURED
// detect upstream change
target: "b2b_catalog_v4"
field_dropped: "price_raw" (string)
field_added: "price_value" (float)
field_added: "currency" (string)

// validate compatibility
check: backward_compatibility
status: FAIL - dropped field "price_raw"

// apply evolution rules
action: create alias "price_raw" -> concat(price_value, currency)
action: set default for "currency" -> "USD"

// re-validate
check: backward_compatibility
status: PASS

// deploy
schema_version: v4.1.0
pipeline_status: RUNNING
// 05 — failure modes

Where schema
changes break.

Ranked by frequency of pipeline failures across unmanaged scraping setups. Type changes are the most destructive because they often pass initial parsing but crash downstream aggregations.

PIPELINES MONITORED ·   300+ active
SCHEMA CHECKS ·  ·  ·  ·  per record
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Implicit type coercion

92% of failures · String to int conversion fails on 'N/A'
02

Dropped required fields

78% of failures · Target removes a DOM element entirely
03

Renamed keys without aliases

64% of failures · API changes 'product_id' to 'sku_id'
04

Nested object flattening

45% of failures · Array of objects becomes a single object
05

Enum value additions

32% of failures · New status code breaks strict validation
// 06 — our architecture

Version everything,

trust nothing.

DataFlirt treats schema evolution as a first-class citizen. We don't just extract data; we extract against a versioned data contract. When a target site changes its structure, our extraction layer quarantines the anomalous records, infers the new schema, and alerts our engineering team. We map the new fields to the existing contract using aliases and default values, ensuring your downstream ingestion never sees a breaking change.

Schema Registry Status

Live status of a schema contract for a real estate pipeline.

contract.id re-listings-v7
compatibility.mode BACKWARD_TRANSITIVE
fields.total 42
fields.deprecated 3aliased
validation.rate 99.98%healthy
quarantine.queue 14 records
downstream.status ingesting normally

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About schema management, compatibility rules, handling upstream drift, and how DataFlirt protects your data warehouse.

Ask us directly →
What is the difference between schema drift and schema evolution? +
Schema drift is the problem; schema evolution is the solution. Drift happens when a target website changes its structure unexpectedly. Evolution is the deliberate, versioned process of updating your extraction logic and data contracts to accommodate that drift without breaking downstream systems.
How do you handle a target site deleting a field we rely on? +
We cannot invent data that no longer exists, but we can prevent pipeline crashes. We soft-deprecate the field in the schema registry, provide a default value (like null or "NOT_PROVIDED"), and alert your team. The schema remains backward compatible, so your ETL jobs won't fail, giving you time to adjust your analytics.
Is it better to use schema-on-read or schema-on-write for scraping? +
Schema-on-write is essential for data quality. If you use schema-on-read (dumping raw JSON into a data lake), you defer the pain of schema drift to the analytics team, who will eventually face a swamp of inconsistent formats. We enforce schema-on-write at the extraction layer, quarantining records that don't match the contract.
How does DataFlirt prevent downstream breakages when schemas evolve? +
Through strict compatibility checks and data contracts. Before a new schema version is deployed, it must pass automated backward compatibility tests. If a target renames a field, we use aliases to map the new name to the old contract. Your data warehouse continues to receive the format it expects.
What happens to historical data when a schema evolves? +
It depends on the compatibility mode. With backward compatibility, your new code can read the old data, so historical records remain valid. If a fundamentally new field is added, historical records will simply return the defined default value for that field when queried.
Are there legal implications to automatic schema evolution? +
Yes. If a target site suddenly starts exposing Personally Identifiable Information (PII) in a previously benign API endpoint, an automatic schema evolution system might ingest it, violating GDPR or CCPA. DataFlirt uses explicit field whitelisting—we only extract the fields defined in the contract. New fields are ignored until explicitly approved.
$ dataflirt scope --new-project --target=schema-evolution READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h