← Glossary / Data Versioning

What is Data Versioning?

Data versioning is the practice of treating datasets like source code, capturing immutable snapshots of data at specific points in time. In scraping pipelines, where target schemas drift and extraction logic evolves, it allows engineers to audit historical states, rollback corrupted partitions, and reproduce machine learning training sets. Without versioning, a bad extraction deployment permanently overwrites good data, turning a temporary bug into an unrecoverable pipeline failure.

Data EngineeringImmutabilityAudit TrailDelta LakeTime Travel
// 02 — definitions

Git for
your datasets.

Why overwriting yesterday's scrape with today's run is a catastrophic anti-pattern in production data engineering.

Ask a DataFlirt engineer →

TL;DR

Data versioning creates immutable snapshots of your data at every write operation. Using formats like Apache Iceberg or Delta Lake, it enables time-travel queries, zero-copy cloning, and instant rollbacks when a scraped schema changes unexpectedly and poisons your downstream analytics.

01Definition & structure

Data versioning is the architectural pattern of maintaining historical states of a dataset alongside its current state. Instead of `UPDATE` or `DELETE` operations physically overwriting data on disk, new data files are written, and a metadata manifest is updated to point to the new files. The old files remain intact.

This creates a linear, immutable history of the dataset. Every transaction generates a unique snapshot ID, allowing engineers to query the table exactly as it existed at any point in the past.

02How it works in practice

When a scraping job finishes, it doesn't dump a massive CSV into a bucket and overwrite the old one. Instead, it writes a set of Parquet files containing only the new or changed records. The table format (like Iceberg) then generates a new metadata file that lists all valid data files for the new version.

If a downstream consumer queries the table, they read the latest metadata and get the current state. If they append `TIMESTAMP AS OF 'yesterday'`, the query engine reads yesterday's metadata file and ignores today's additions.

03The role of open table formats

Versioning at scale requires specialized formats. Apache Iceberg, Delta Lake, and Apache Hudi are the industry standards. They sit on top of raw object storage (S3/GCS) and provide the transactional guarantees (ACID) and metadata management necessary to make versioning performant. Without them, finding the right version of a file across a petabyte-scale data lake requires expensive, slow directory listings.

04How DataFlirt handles it

We treat all scraped data as an append-only log. When we deliver datasets to clients, we write directly to their Iceberg or Delta tables. If a target website changes its layout and our extraction logic temporarily fails, the bad data is committed as a new snapshot. We simply issue a rollback command to revert the table to the previous snapshot, patch the scraper, and backfill the missing hours. The client's analytics dashboards never see the corrupted state.

05Did you know: The cost of overwrites

In traditional data warehouses, an `UPDATE` statement is incredibly expensive. It requires locking the row, writing to a transaction log, updating the row in place, and updating all associated indexes. In a versioned data lake, an update is just a fast, lock-free sequential write of a new file, followed by a metadata pointer swap. Versioning isn't just safer—for large analytical workloads, it's actually faster.

// 03 — storage math

The cost of
immutability.

Versioning doesn't mean duplicating the entire dataset on every run. Modern table formats use pointer-based snapshots and delta files to keep storage costs linear to the change rate, not the total volume.

Storage Cost = Base_Size + Σ(Delta_Sizet)
Storage scales with churn (updated/deleted rows), not total records. Delta Lake architecture
Time Travel Query = SELECT * FROM table TIMESTAMP AS OF t
Retrieves exact state at time t without restoring backups. ANSI SQL:2011
DataFlirt Retention Ratio = Active_Bytes / Historical_Bytes1:4
30-day version retention across our managed data lakes. Internal SLO
// 04 — pipeline execution

Committing a
scraped partition.

A daily scrape of 2.4M product records being committed to an Apache Iceberg table. The transaction creates a new snapshot without touching historical data.

Apache IcebergS3Time Travel
edge.dataflirt.io — live
CAPTURED
// init transaction
job.id: "scrape-catalog-v7"
table: "prod.catalog.raw_items"
current_snapshot: 84920114829

// write data files
s3.put: "s3://df-lake/data/part-0001.parquet" ok
s3.put: "s3://df-lake/data/part-0002.parquet" ok
records_written: 2,410,992

// write metadata & commit
manifest.generate: "snap-84920114830.avro"
schema.validation: ok // no drift detected
commit.status: ok
new_snapshot: 84920114830

// audit
rows_added: 14,201 rows_updated: 84,112 rows_deleted: 412
time_travel_ref: "2026-05-19T08:00:00Z"
// 05 — failure recovery

Why pipelines
need rollbacks.

The most common reasons DataFlirt clients use time-travel queries to revert to a previous data version. Schema drift remains the primary driver of data corruption.

PIPELINES ·  ·  ·  ·  ·   300+ active
RETENTION ·  ·  ·  ·  ·   30 days
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Schema drift / bad extraction

89% of rollbacks · Silent nulls written to production
02

Upstream data poisoning

72% of rollbacks · Target site published bad data
03

ML model reproducibility

64% of rollbacks · Training set needs exact historical state
04

Accidental deletion / DROP

41% of rollbacks · Human error in downstream transform
05

Regulatory audit

28% of rollbacks · Proving what data was held on date X
// 06 — our architecture

Append-only,

because updates are a myth.

In web scraping, you never truly update a record—you observe a new state. DataFlirt's delivery layer treats all extractions as append-only events. We write new Parquet files and update the Iceberg manifest to point to the latest state. If a selector breaks and we extract 10,000 null prices, the previous prices aren't overwritten. We simply roll the manifest pointer back to the previous snapshot, fix the selector, and re-run. Zero downtime, zero data loss.

Iceberg Table State

Snapshot metadata for a live pricing table.

table.name df_pricing_global
format Apache Iceberg v2
current_snapshot snap-99281
parent_snapshot snap-99280
retention.policy 30 days
total_size 4.2 TB
active_size 850 GB

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About data versioning, open table formats, storage costs, and how DataFlirt ensures pipeline immutability.

Ask us directly →
What's the difference between data versioning and database backups? +
Backups are coarse, monolithic copies taken at intervals (e.g., nightly). Data versioning is granular and transactional—every write operation creates a new version. You can query a versioned table exactly as it looked at 2:14 PM without restoring a 5TB database from cold storage.
Does versioning double my storage costs? +
No. Modern formats like Delta Lake and Apache Iceberg use copy-on-write or merge-on-read mechanics. They only store the delta (the changed files) between versions. If you change 1% of a 1TB dataset, the new version adds ~10GB of storage, not 1TB.
How does DataFlirt handle versioning for delivered datasets? +
We deliver data into S3/GCS using open table formats (Iceberg or Delta). Every delivery run commits a new snapshot. If you consume our data, you can use standard SQL time-travel syntax in Snowflake or Databricks to query historical states or rollback if your internal ETL breaks.
Why is versioning critical for machine learning? +
Model reproducibility. If you train a pricing model on scraped data today, and need to debug it six months from now, you must train on the exact same data. If the dataset was overwritten with new prices, debugging is impossible. Versioning guarantees you can retrieve the exact training snapshot.
Can I version JSON or CSV files? +
Technically yes, by putting them in Git (via DVC) or using S3 object versioning. But it's highly inefficient for large datasets. Table formats (Parquet + Iceberg/Delta) are designed specifically for tabular data versioning at petabyte scale.
How long should I retain historical versions? +
For active scraping pipelines, 30 days is the standard operational retention—enough time to catch silent extraction failures and rollback. For compliance or ML training, specific snapshots are often tagged and retained indefinitely, while intermediate operational versions are vacuumed.
$ dataflirt scope --new-project --target=data-versioning READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h