← Glossary / Change Data Capture (CDC)

What is Change Data Capture (CDC)?

Q: Why not just use an updated_at column to query for changes?

Query-based CDC (polling an updated_at timestamp) puts heavy read load on the database, misses hard deletes entirely, and introduces latency. Log-based CDC reads the Write-Ahead Log (WAL) directly, adding zero query overhead and capturing every intermediate state change instantly.

Change Data Capture (CDC) is an architectural pattern that identifies and tracks row-level changes in a database — inserts, updates, and deletes — and streams those events to downstream consumers in real time. In the context of scraping pipelines, CDC flips the extraction model from batch polling to event-driven delivery. Instead of querying a massive target catalog every night to see what changed, a CDC-enabled pipeline streams only the delta records, drastically reducing compute costs, egress bandwidth, and target server load.

Data EngineeringEvent StreamingDelta DeliveryKafkaDebezium

// 02 — definitions

Stream the
deltas.

Why pulling the entire dataset every 24 hours is a massive waste of compute, bandwidth, and time.

Ask a DataFlirt engineer →

TL;DR

CDC captures database mutations as they happen and writes them to an immutable log. For data buyers, this means receiving a continuous feed of price drops, out-of-stock events, and new product listings within seconds of the scraper detecting them, rather than waiting for a daily batch dump.

01Definition & structure

Change Data Capture (CDC) is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data. In modern data engineering, this almost exclusively refers to log-based CDC, where a tool like Debezium reads the database's Write-Ahead Log (WAL) and publishes every insert, update, and delete as an event to a message broker like Kafka.

02How it works in practice

When a scraper updates a product's price in the database, the database engine writes that transaction to its log. The CDC connector reads this log asynchronously, constructs a JSON or Avro payload containing the before state and the after state, and pushes it to a Kafka topic. Downstream consumers read this topic and apply the changes to their own data warehouses or trigger real-time alerts.

03CDC in scraping pipelines

Scraping is inherently a state-reconciliation problem. You are constantly fetching the current state of a website and comparing it to your known state. CDC isolates the extraction layer from the delivery layer. The scrapers just blindly upsert data into a database. The CDC layer handles the complex logic of figuring out what actually changed and notifying the client, ensuring the client only processes net-new information.

04How DataFlirt handles it

We run Debezium connectors attached to our primary PostgreSQL clusters. Every pipeline has a dedicated Kafka topic. For clients who want real-time feeds, we provide secure Kafka consumer credentials. For clients who prefer batch, we use Kafka Connect to sink the CDC events into hourly or daily delta files in S3. This unified architecture means we never have to write custom diffing logic per client.

05The silent failure of batch processing

If you rely on a daily batch export, you are blind to intra-day volatility. If a competitor drops their price at 10:00 AM and raises it back at 4:00 PM, a midnight batch export will show no change. CDC captures the full timeline of state mutations, ensuring you have a perfect historical ledger of every fluctuation, not just a point-in-time snapshot.

// 03 — the efficiency math

Why batch
doesn't scale.

CDC reduces pipeline latency and compute overhead by orders of magnitude. Here is how DataFlirt calculates the efficiency gains of delta streaming versus full-table polling.

Egress reduction = E_saved = 1 − (mutations / total_records)

Typically >95% savings for e-commerce catalogs where most prices remain static daily. DataFlirt pipeline metrics

CDC Latency = L_cdc = t_delivery − t_scrape

Time from the scraper writing to the database to the client receiving the event. Streaming SLO

DataFlirt Delta Ratio = Δ = records_changed / records_scraped

Used to automatically scale Kafka partition counts during high-volatility events. Internal scaling heuristic

// 04 — the event stream

A price drop,
captured and streamed.

A live trace of a CDC event generated when a scraper detects a price change on a target e-commerce site, processed through Debezium and Kafka.

DebeziumKafkaJSON payload

edge.dataflirt.io — live

CAPTURED

// WAL entry detected
op: "u" // update operation
ts_ms: 1716124800000
table: "scraped_products"

// before state
before.sku: "B08N5WRWNW"
before.price: 1299.00
before.stock: true

// after state
after.sku: "B08N5WRWNW"
after.price: 999.00 // price drop detected
after.stock: true

// delivery routing
topic: "client_042_price_alerts"
partition: 4
offset: 891244
status: DELIVERED

// 05 — implementation hurdles

Where CDC
pipelines break.

Ranked by frequency of incidents across DataFlirt's streaming infrastructure. CDC is powerful but introduces complex state management challenges that batch processing avoids.

ACTIVE STREAMS · · · 300+ pipelines

EVENT VOLUME · · · · 1.2B / day

UPDATED · · · · · · 2026-05-19

Schema evolution handling

% of incidents · Schema registry mismatches when upstream fields change

Snapshotting large tables

% of incidents · Initial load timeouts on multi-terabyte datasets

Consumer lag

% of incidents · Downstream systems failing to process the firehose

Tombstone record handling

% of incidents · Hard deletes breaking downstream aggregations

Transaction boundaries

% of incidents · Partial updates splitting across multiple events

// 06 — our architecture

Capture everything,

stream only what matters.

DataFlirt's extraction workers write raw records to a distributed PostgreSQL cluster. We use Debezium to tail the Write-Ahead Log (WAL), converting every insert and update into a Kafka event. This decouples our scraping fleet from our delivery layer. Clients can subscribe to the raw firehose, or we can materialize the stream into daily delta files delivered to S3. You never pay to ingest data that hasn't changed.

CDC Stream Health

Live metrics for a high-frequency pricing pipeline.

topic df.stream.pricing.v2

events.per_sec 4,200

replication.lag 12ms

schema.registry connected

dead_letter_queue 0

consumer.status lagging

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About log-based CDC, schema registries, consumer lag, and how DataFlirt delivers real-time scraping data.

Ask us directly →

What is the difference between CDC and a regular API webhook? +

CDC operates at the database storage engine level (reading the transaction log), while webhooks operate at the application level. CDC guarantees ordering, captures every single mutation (including deletes), and ensures no events are lost if the application crashes. Webhooks are prone to race conditions and dropped payloads.

How does CDC work for web scraping if the target site doesn't offer CDC? +

We don't run CDC on the target site's database — we run it on ours. Our scraping fleet continuously fetches the target site and upserts the records into our internal PostgreSQL cluster. Our CDC infrastructure then tails our own database, identifying the exact fields that changed, and streams those deltas to you.

Why not just use an updated_at column to query for changes? +

Query-based CDC (polling an updated_at timestamp) puts heavy read load on the database, misses hard deletes entirely, and introduces latency. Log-based CDC reads the Write-Ahead Log (WAL) directly, adding zero query overhead and capturing every intermediate state change instantly.

Do I need Kafka to consume CDC data from DataFlirt? +

No. While we offer direct Kafka topic access for enterprise clients, we also sink CDC streams into S3 or GCS as Parquet or JSONL delta files. You get the efficiency of CDC without needing to manage a streaming architecture on your end.

What happens if my consumer goes offline? +

Kafka retains events for a configurable period (typically 7 days on our clusters). When your consumer comes back online, it simply resumes reading from its last committed offset. No data is lost, and you don't need to trigger a full re-scrape.

How do you handle schema changes in a CDC stream? +

We use a Schema Registry with Avro or Protobuf serialization. When a target site changes and our extraction schema evolves, the registry enforces compatibility rules (e.g., backward compatibility). Downstream consumers are alerted to the new schema version without breaking the existing stream.

$ dataflirt scope --new-project --target=change-data-capture-(cdc) READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h

What is Change Data Capture (CDC)?

Stream thedeltas.

TL;DR

Why batchdoesn't scale.

A price drop,captured and streamed.

Where CDCpipelines break.

Schema evolution handling

Snapshotting large tables

Consumer lag

Tombstone record handling

Transaction boundaries

Capture everything,

CDC Stream Health

Stay ahead of the pipeline

Data engineeringintel, weekly.

Commonquestions.

Tell us whatto extract.We do the rest.

Related glossary terms

Delta File Delivery

Apache Kafka

Real-Time Data Pipeline

Data Latency