← Glossary / Delta Lake

What is Delta Lake?

Delta Lake is an open-source storage framework that brings ACID transactions, scalable metadata handling, and unified streaming and batch data processing to existing data lakes. For scraping pipelines, it solves the "append-only" problem: allowing engineers to safely upsert new records, enforce schema contracts, and time-travel through historical dataset versions without corrupting the underlying Parquet files.

Data LakehouseACID TransactionsParquetSchema EvolutionTime Travel
// 02 — definitions

Transactions for
the data lake.

How to stop treating your scraped data like a fragile dump of CSVs and start treating it like a production database.

Ask a DataFlirt engineer →

TL;DR

Delta Lake sits on top of your existing cloud storage (S3, GCS) and manages Parquet files via a transaction log. It guarantees that if a scraper crashes mid-write, your dataset isn't left in a corrupted, half-written state. It is the foundational layer that enables the medallion architecture (Bronze, Silver, Gold) for modern data pipelines.

01Definition & structure

Delta Lake is an open-source storage layer that brings reliability to data lakes. At its core, a Delta table is just a directory in object storage (like S3) containing two things: a collection of data files in Parquet format, and a _delta_log directory containing a transaction log.

The transaction log is the source of truth. It records every change made to the table — which files were added, which were removed, and what the schema looks like. When a query engine reads a Delta table, it first reads the log to determine exactly which Parquet files to scan, ignoring any files that are partially written or marked for deletion.

02ACID transactions on object storage

Cloud object stores like S3 are eventually consistent and don't support native file modifications. If a scraper crashes while writing a 5GB file, a downstream query might read corrupted, half-written data. Delta Lake provides ACID guarantees using optimistic concurrency control.

Writers create new Parquet files invisibly. Only when the write is complete do they attempt to commit a new entry to the transaction log. If two scrapers try to modify the same data simultaneously, Delta detects the conflict, allows one to succeed, and forces the other to retry. Readers are never blocked by writers.

03Schema enforcement & evolution

Data lakes often turn into data swamps because anyone can write any file format with any columns. Delta Lake enforces schema on write. If a scraper tries to insert a string into a numeric price column, the transaction fails before any data is committed.

However, schemas in scraping inevitably change. Delta supports safe schema evolution. By setting a flag, you can explicitly allow new columns to be added to the table. Delta updates the metadata in the transaction log, and all historical records simply return null for the new column when queried.

04How DataFlirt handles it

We use Delta Lake as the delivery interface for enterprise clients. Instead of pushing raw JSON or CSV files that clients have to manually deduplicate, we maintain a Gold-tier Delta table in their S3 bucket. Our pipelines run MERGE operations to upsert new scraped records and update changed ones.

We run automated OPTIMIZE and VACUUM jobs weekly to compact small files and purge stale data outside the retention window, ensuring the client's Athena or Databricks queries remain lightning fast without ballooning their AWS storage bill.

05Time travel and auditing

Because Delta Lake doesn't immediately delete old Parquet files when records are updated, you get "time travel" out of the box. You can query the table exactly as it looked at a specific timestamp or version number.

This is invaluable for scraping pipelines. If a bad selector deployment corrupts a day's worth of pricing data, you don't need to restore from a backup. You simply run RESTORE TABLE catalog TO VERSION AS OF 142, and the transaction log reverts, instantly undoing the damage.

// 03 — the storage model

How Delta manages
file state.

Delta Lake doesn't overwrite files; it writes new ones and updates a JSON/Parquet transaction log. This is how DataFlirt calculates storage overhead and read latency for versioned datasets.

Effective storage cost = S = base_data + (change_rate × retention_days)
Time travel requires keeping stale Parquet files until the VACUUM command is run. Storage overhead model
Read latency = L = log_parse_time + parquet_scan_time
Log parsing scales with the number of commits since the last checkpoint. Query execution path
Compaction ratio = C = small_files / target_file_size
OPTIMIZE commands merge small scraper outputs into efficient 1GB chunks. Delta Lake best practices
// 04 — delta log trace

A scraping upsert
in the transaction log.

What happens when a daily pricing scraper updates existing records and appends new ones. The transaction log records the exact file additions and removals, ensuring readers only see committed data.

_delta_logMERGE INTOJSON commit
edge.dataflirt.io — live
CAPTURED
// commit 00000000000000042.json
operation: "MERGE"
predicate: "target.sku_id = source.sku_id"

// files removed (tombstoned for time travel)
remove: { path: "part-0001...snappy.parquet", size: 1048576 }

// files added (new inserts + updated records)
add: { path: "part-0002...snappy.parquet", size: 1050112, records: 1420 }
add: { path: "part-0003...snappy.parquet", size: 84200, records: 115 }

// schema validation
schema.enforced: true
schema.drift_detected: false
commit.status: COMMITTED
// 05 — failure modes

Where data lakes
get muddy.

Without Delta Lake, raw Parquet data lakes fail in predictable ways. Ranked by frequency of occurrence in legacy scraping pipelines.

PIPELINES MONITORED ·   300+ active
STORAGE FORMAT ·  ·  ·    Delta / Parquet
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Small file problem

Performance killer · Scrapers writing 10KB files destroy query planning times
02

Schema drift corruption

Silent failure · Scraper adds a column, breaking downstream Athena queries
03

Dirty reads

Consistency issue · Querying a partition while a scraper is actively writing to it
04

Failed job partial writes

Data integrity · Scraper crashes, leaving orphan files that get queried
05

Concurrent write conflicts

Concurrency · Two scraper workers updating the same partition simultaneously
// 06 — our architecture

Append is easy,

upserting at scale is hard.

DataFlirt delivers datasets directly to client S3 buckets using Delta Lake. Instead of sending daily CSV dumps that require complex downstream deduplication, we perform a MERGE operation on the Delta table. Clients query the table and instantly see the latest state of the web, with full historical versioning preserved in the transaction log. If a target site changes its layout and corrupts a run, we simply roll back the Delta table to the previous version.

Delta Delivery Sync

Live metrics from a B2B catalog sync into a client's Delta Lake.

target.table s3://client-bucket/gold/catalog
operation MERGE INTO
records.inserted 12,405
records.updated 84,112
schema.evolution allowed
time_travel.retention 30 days
sync.status COMMITTED

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about Delta Lake, lakehouse architectures, and how DataFlirt delivers structured data.

Ask us directly →
What is the difference between Delta Lake and Parquet? +
Parquet is a columnar file format; it just holds data. Delta Lake is a storage layer that sits on top of Parquet files. It adds a transaction log (_delta_log) that tracks which Parquet files belong to the current version of the table. You can't do ACID transactions or time travel with raw Parquet.
Why not just use Snowflake or BigQuery? +
Snowflake and BigQuery are data warehouses — they couple storage and compute, and you pay a premium to store data in their proprietary formats. Delta Lake is an open format that lives in your own S3/GCS bucket. You own the data, and you can query it with any compute engine (Databricks, Athena, Trino, DuckDB) without moving it.
How does time travel actually work? +
When you update or delete records, Delta Lake doesn't delete the old Parquet files immediately. It writes new files and updates the transaction log to point to them. The old files remain in storage. When you query with VERSION AS OF 5, the engine simply reads the log for version 5 and scans the files that were active at that time.
What is the 'small file problem' in scraping? +
Scrapers often run in micro-batches, writing thousands of tiny 10KB files to S3. When an analytics engine tries to query this, it spends more time opening files and reading metadata than actually processing data. Delta Lake solves this with the OPTIMIZE command, which safely compacts small files into large, efficient 1GB chunks in the background.
How does DataFlirt handle schema evolution in Delta? +
We enforce strict schemas at the Bronze extraction layer. If a scraper finds a new field, it doesn't blindly write it. However, if a client requests a new column, we enable mergeSchema = true on the Delta write. Delta automatically updates the table schema and backfills nulls for historical records without requiring a full table rewrite.
Can I query a Delta Lake table with standard SQL? +
Yes. Almost every modern query engine supports Delta Lake natively. You can point AWS Athena, Presto, Trino, or even a local DuckDB instance at the S3 path containing the _delta_log, and query it using standard ANSI SQL.
$ dataflirt scope --new-project --target=delta-lake READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h