← Glossary / Stream Processing

What is Stream Processing?

Stream processing is the continuous ingestion, transformation, and delivery of data as it is generated, rather than waiting for scheduled batch intervals. In scraping infrastructure, it shifts the paradigm from "crawl the catalog every night" to "emit a price change event the millisecond a product page updates." For data buyers, it means trading the latency of daily S3 dumps for the immediacy of a live Kafka topic or webhook feed.

Data EngineeringReal-TimeKafkaFlinkEvent-Driven
// 02 — definitions

Data in
motion.

Why waiting for the midnight batch job is no longer viable for pricing intelligence, inventory tracking, or news aggregation.

Ask a DataFlirt engineer →

TL;DR

Stream processing treats data as an unbounded, continuous flow of events. Instead of querying a database for what happened yesterday, a stream processor evaluates logic against data as it arrives. It requires fundamentally different infrastructure — message brokers like Kafka and compute engines like Flink — but reduces data latency from hours to milliseconds.

01Definition & structure

Stream processing is a data engineering paradigm where data is processed continuously as it is produced. Instead of storing data in a database and querying it later (batch processing), logic is applied to the data while it is in transit.

A typical stream processing architecture consists of three components:

  • Producers — scrapers or sensors emitting events.
  • Brokers — distributed logs (like Kafka) that buffer the events durably.
  • Processors — compute engines (like Flink) that consume the log, apply transformations, and push the results to a sink.
02Event time vs. Processing time

A critical concept in stream processing is the distinction between time domains. Event time is when the data was actually generated (e.g., the moment the scraper parsed the price). Processing time is when the stream engine evaluates the record.

Because networks are unreliable and scrapers use distributed proxies, events often arrive out of order. Robust stream processors use event time and "watermarks" (a threshold indicating that no older events are expected) to ensure calculations are accurate regardless of ingestion delays.

03The statefulness problem

Stateless operations (like filtering out null values) are trivial in a stream. Stateful operations (like calculating a moving average, or deduplicating against previous records) are hard. The stream processor must maintain memory of past events.

If the processor crashes, that state is lost unless it is durably backed up. Modern stream engines solve this by periodically snapshotting their internal state to distributed storage, allowing them to recover exactly where they left off without data loss or duplication.

04How DataFlirt handles it

We treat every extraction as an event. Our scraping fleet publishes raw records directly to Kafka topics. Flink clusters consume these topics to enforce schema contracts, drop duplicates, and normalize fields in real time.

This decoupled architecture means our extraction workers never block waiting for a database write, and our clients receive data via webhooks or cross-account Kafka peering with sub-200ms latency from the moment of extraction.

05Did you know?

Stream processing isn't just for speed — it's often used for cost reduction. By applying stateful deduplication in the stream, you can filter out redundant data (like a product whose price hasn't changed) before it ever reaches your expensive data warehouse. We routinely see clients drop 80% of their ingestion volume simply by moving deduplication upstream into the stream processor.

// 03 — stream metrics

Measuring data
in motion.

Stream processing performance is defined by throughput, latency, and the correctness of windowed aggregations. DataFlirt monitors these continuously across all live feeds.

End-to-end Latency = L = tdeliverytextraction
Time from the scraper parsing the DOM to the client receiving the payload. DataFlirt SLA
Throughput = T = events_processed / time_window
Must consistently exceed ingestion rate to prevent unbounded queue growth. Stream Processing Fundamentals
Watermark Delay = W = tsystemmax_event_time
Measures how far behind real-time the processor's event-time clock has drifted. Apache Flink Metrics
// 04 — stream execution trace

Processing a live
price change event.

A live trace from a Flink worker processing a stream of B2B e-commerce scrapes. It validates the schema, checks state for deduplication, and routes the event.

Apache FlinkStateful DedupWebhook Sink
edge.dataflirt.io — live
CAPTURED
// stream ingestion
topic: "raw-scrapes-b2b"
offset: 8492011

// schema validation
event.type: "price_update"
event.payload: { sku: "XJ-900", price: 45.99 }
schema.match: true

// stateful deduplication
state.previous_price: 45.99
diff.detected: false // dropping duplicate

// next event
event.payload: { sku: "XJ-900", price: 42.50 }
diff.detected: true

// sink delivery
sink: "webhook-client-04"
delivery.status: 200 OK
latency.e2e: 142ms
// 05 — stream bottlenecks

Where latency
creeps in.

The primary sources of delay and failure in a continuous stream processing pipeline. Ranked by frequency of impact across DataFlirt's real-time feeds.

EVENTS/SEC ·  ·  ·  ·  ·  140k peak
P99 LATENCY ·  ·  ·  ·    210ms
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Backpressure from slow sinks

delivery layer · Client webhook takes 2s to respond, blocking the stream
02

Stateful memory exhaustion

compute layer · Windowing too many unique keys without TTL eviction
03

Schema evolution mismatches

validation layer · Target site changes format, breaking downstream parsers
04

Out-of-order event buffering

logic layer · Waiting for late-arriving scrapes to close a time window
05

Broker partition rebalancing

infra layer · Kafka reassigning partitions during node scaling
// 06 — our architecture

Sub-second delivery,

without sacrificing schema enforcement.

DataFlirt's stream processing backbone is built on Apache Kafka and Flink. When a scraper extracts a record, it doesn't write to a database — it emits an event. Flink consumes this stream, applies schema validation, performs stateful deduplication against the last known value, and routes the payload to the client's webhook or Kafka topic. If a record fails validation, it is routed to a dead-letter queue instantly, never blocking the main pipeline.

flink-job-manager

Live telemetry from a production stream processing job.

job.status RUNNING
events.in 12,405/sec
events.out 312/secdeduped
watermark.lag 45ms
backpressure none
checkpoint.status COMPLETED

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About real-time data pipelines, stateful processing, delivery semantics, and how DataFlirt operates streaming infrastructure at scale.

Ask us directly →
What is the difference between batch and stream processing? +
Batch processing runs on a schedule (e.g., every midnight) over a bounded, finite dataset. Stream processing runs continuously over an unbounded dataset, reacting to events as they arrive. Batch is for historical analysis; stream is for operational immediacy.
Can I consume streaming data without hosting my own Kafka cluster? +
Yes. While enterprise clients often peer their Kafka clusters directly with ours, we also deliver streaming data via Webhooks, Server-Sent Events (SSE), or continuous micro-batch writes to cloud storage (e.g., appending JSON Lines to an S3 bucket every 60 seconds).
How do you handle out-of-order scrapes? +
Through event-time processing and watermarks. We timestamp every record at the exact moment of extraction. The stream processor uses these timestamps — not the time the record arrived at the server — to sequence events, holding a buffer (watermark) to allow late-arriving records to be processed in the correct order.
Is stream processing more expensive than batch? +
In terms of raw compute, yes. Stream processing requires always-on infrastructure (brokers, task managers) rather than ephemeral instances that spin down when a job finishes. However, for use cases like algorithmic trading or live inventory monitoring, the business value of sub-second latency vastly outweighs the infrastructure premium.
How does DataFlirt ensure exactly-once delivery? +
We use Apache Flink's distributed snapshotting (checkpointing) combined with idempotent sinks. If a worker node crashes, the system rolls back to the last successful checkpoint and replays the stream. Because the delivery sink (like an upsert in a database) is idempotent, replaying the data doesn't create duplicates.
What happens to the stream if the target site blocks the scraper? +
The stream volume drops. Because stream processing is continuous, a sudden drop in throughput is an immediate anomaly. Our observability layer detects this deviation within seconds, triggering automated proxy rotation or alerting an on-call engineer, long before a batch job would have even started.
$ dataflirt scope --new-project --target=stream-processing READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h