← Glossary / Diff-Based Data Delivery

What is Diff-Based Data Delivery?

Diff-based data delivery is a pipeline pattern where a scraper only transmits records that have been added, modified, or deleted since the last successful extraction, rather than sending the entire dataset every run. For high-frequency pipelines tracking millions of SKUs or real estate listings, it reduces egress costs, minimizes downstream database load, and isolates the exact state changes your business logic actually cares about.

Delta DeliveryChange Data CaptureEgress OptimizationETLStateful Scraping
// 02 — definitions

Send the change,
not the state.

Why transmitting a 50 MB delta file is infinitely better than dumping a 12 GB full-catalog CSV into your S3 bucket every hour.

Ask a DataFlirt engineer →

TL;DR

Diff-based delivery (often called delta delivery) compares the current scrape against the previous run's state. It outputs three distinct record types: inserts, updates, and deletes. This approach shifts the compute burden of deduplication from the client's data warehouse to the scraping provider's infrastructure.

01Definition & structure

In a standard scraping pipeline, every run produces a complete snapshot of the target. In a diff-based data delivery model, the pipeline compares the newly extracted data against the data from the previous run, delivering only the delta. A standard delta payload categorizes records into three buckets:

  • inserts — new primary keys that did not exist in the previous state.
  • updates — existing primary keys where the hash of the extracted fields has changed.
  • deletes — primary keys that existed in the previous state but are provably absent from the current scrape.
02How state is maintained
To compute a diff, the scraper must have memory. This is achieved by maintaining a persistent key-value store (like Redis) alongside the extraction workers. The store maps a unique identifier (e.g., a product SKU or property ID) to a deterministic hash of the record's contents. During extraction, the worker computes the hash of the new record and queries the store. If the hash matches, the record is discarded from the delivery queue. If it differs, the record is queued for delivery and the store is updated with the new hash.
03The "Soft Delete" problem
The most dangerous failure mode in diff-based delivery is the false delete. If a proxy fails, a target site throws a 503, or an anti-bot challenge blocks a category page, the scraper won't see the records. A naive diff engine will assume those records were deleted by the target and issue a massive wave of delete events, wiping out your downstream database. Production diff engines must cross-reference missing keys against pipeline error logs to ensure a record is only marked deleted if its parent container was successfully fetched and parsed.
04How DataFlirt handles it
We run stateful diffing at the edge, immediately after the extraction and schema validation layers. Our diff engine allows clients to define custom hashing rules—for example, ignoring volatile fields like "stock counter" while strictly tracking "price". We also enforce a "delete confidence threshold": if a scrape run results in more than 5% of the catalog being marked for deletion, the delivery is automatically paused and flagged for manual engineering review to prevent catastrophic downstream data loss.
05Did you know?
For enterprise data teams, the primary driver for diff-based delivery isn't saving network bandwidth—it's saving Snowflake or BigQuery compute credits. Running a massive MERGE or UPSERT operation on 10 million records every hour is incredibly expensive. By shifting the deduplication compute to the scraping provider's Redis cluster, data engineering teams often cut their ingestion warehouse costs by over 80%.
// 03 — the delta math

How much bandwidth
does a diff save?

The efficiency of diff-based delivery depends entirely on the volatility of the target dataset. DataFlirt calculates the churn rate to determine if delta delivery is mathematically and financially viable for a given target.

Dataset Churn Rate = C = (Inserts + Updates + Deletes) / Total_Records
If C > 0.8, diffing wastes compute. If C < 0.1, diffing saves massive bandwidth. DataFlirt Pipeline Heuristics
Payload Reduction = 1 − (Delta_Bytes / Full_Export_Bytes)
Typical reduction for daily e-commerce pricing pipelines is 94–98%. DataFlirt Egress Metrics
State Hash Comparison = Hcurrent == Hprevious  →  Drop
We hash the extracted fields, not the raw HTML, to avoid false positives from CSS changes. Extraction Engine Logic
// 04 — delta generation trace

Computing the diff
at the edge.

A live trace of a diff generation job for a B2B electronics catalog. The pipeline fetches 1.2 million URLs, but only delivers the 4,102 records that actually changed.

Stateful ScrapeSHA-256 HashingS3 Delivery
edge.dataflirt.io — live
CAPTURED
// load previous state
state.backend: "redis-cluster-04"
state.keys_loaded: 1,204,991

// extraction & hash comparison
record.id: "SKU-88421" hash.prev: "a7f9...2b1" hash.curr: "a7f9...2b1" // unchanged
record.id: "SKU-88422" hash.prev: "c3d1...9f4" hash.curr: "e8b2...1a0" // updated
record.id: "SKU-99100" hash.prev: null hash.curr: "f1a4...7c2" // inserted

// missing record detection (soft deletes)
missing_keys: 142
validation.page_errors: 0 // confirmed true deletes, not fetch failures

// delta compilation
delta.inserts: 840
delta.updates: 3,120
delta.deletes: 142
payload.size: "1.8 MB" // vs 412 MB full export
delivery.status: uploaded to s3://df-client-019/deltas/2026-05-19.json
// 05 — failure modes

Why diffs get
artificially bloated.

When a delta file is suddenly 10x its normal size, it rarely means the target site overhauled its inventory. It usually means the extraction layer is capturing volatile noise instead of stable data.

PIPELINES MONITORED ·   180+ stateful
AVG DELTA SIZE ·  ·  ·    4.2% of full
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Unstable primary keys

fatal · Target rotates internal IDs daily, forcing 100% churn
02

Timestamp drift in payload

common · Scraping 'last_viewed' or 'fetched_at' breaks the hash
03

Dynamic sorting / pagination

structural · Records shift across pages, causing false deletes
04

A/B test variant leakage

noise · Price toggles between two values based on proxy IP
05

Whitespace / encoding shifts

formatting · Target changes minification, altering raw string hashes
// 06 — our architecture

Compute the delta upstream,

so your warehouse doesn't have to.

DataFlirt maintains a persistent state layer for every delta pipeline using a distributed Redis cluster. We don't just compare raw HTML; we compare the parsed, typed extraction output. If a target site changes its CSS framework but the underlying price and stock status remain identical, our diff engine registers zero changes. Your downstream ingestion only wakes up when real business value shifts.

diff-job-status.json

Live state metrics from a high-frequency real estate pipeline.

pipeline.id re-delta-UK-04
state.backend redis-cluster-eu
records.total 4,192,000
records.changed 18,4020.4% churn
hash.strategy exclude_timestamps
delete.confidence 0.99verified
egress.saved 1.4 GBoptimized

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About state management, handling deletes, full-sync cadences, and how DataFlirt ensures delta integrity at scale.

Ask us directly →
What's the difference between diff-based delivery and Change Data Capture (CDC)? +
CDC is a database-level concept where you read the transaction log (like PostgreSQL's WAL) to stream changes. Diff-based delivery is the web-scraping equivalent. Because we don't have access to the target's database logs, we have to synthesize CDC by fetching the current state, hashing the records, and comparing them against our own stored state from the previous run.
How do you handle deletes if a page just fails to load? +
This is the hardest problem in stateful scraping. If a category page returns a 500 error, the records on it aren't deleted — they're just temporarily unreachable. DataFlirt uses a "missing threshold" and cross-references fetch errors. We only emit a delete event if the parent page loaded successfully (200 OK) and the specific SKU was absent, or if the SKU's direct URL returns a hard 404.
What happens if our downstream database gets out of sync with your state? +
Every diff-based pipeline requires a "full sync" capability. By default, DataFlirt delivers delta files hourly or daily, but drops a complete, full-state snapshot into a separate S3 prefix once a week. If your warehouse drops a delta or gets corrupted, you just truncate your table and ingest the latest weekly full-sync file to reset your baseline.
How does DataFlirt track state across millions of records? +
We don't store the full text of previous scrapes in memory — that would be cost-prohibitive. We store a composite primary key and a deterministic SHA-256 hash of the extracted payload in Redis. When a new record is extracted, we hash it. If hash(new) == hash(stored), we drop it. If it differs, we emit an update and overwrite the stored hash.
Is diff-based delivery more expensive to run? +
Compute-wise, yes. We have to maintain a highly available Redis cluster and perform millions of hash lookups per run. However, for clients ingesting data into Snowflake or BigQuery, the compute cost we charge for state management is almost always lower than the warehouse compute cost of UPSERTing 5 million identical records every day.
Can we exclude certain fields from triggering an update? +
Yes. This is critical for fields like "views today" or "current active bidders" which change every second but have no structural value. In your DataFlirt schema contract, you can mark specific fields as ignore_in_diff. They will be extracted and delivered if the record updates for another reason, but their mutation alone won't trigger a delta event.
$ dataflirt scope --new-project --target=diff-based-data-delivery READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h