← Glossary / Apache Flink

What is Apache Flink?

Apache Flink is a distributed processing engine for stateful computations over unbounded and bounded data streams. In the context of web scraping and data engineering, it's the infrastructure layer that transforms raw, continuous firehoses of scraped events into structured, deduplicated, and aggregated datasets in real time. Unlike batch processors that run on a schedule, Flink treats data as an infinite stream, allowing pipelines to react to price changes or inventory drops the millisecond they are extracted.

Stream ProcessingStateful ComputationReal-TimeEvent-DrivenData Engineering

// 02 — definitions

Continuous
computation.

Why waiting for a nightly batch job is no longer viable for high-frequency pricing, inventory, and news pipelines.

Ask a DataFlirt engineer →

TL;DR

Apache Flink processes data streams in real time with exactly-once state consistency. It sits downstream of message brokers like Kafka, consuming raw scraped records, applying complex windowing or deduplication logic, and sinking the clean data to a lakehouse or operational database without batch latency.

01Definition & structure

Apache Flink is an open-source, unified stream-processing and batch-processing framework. In a modern data stack, it is used to execute complex, stateful computations over unbounded data streams. It features exactly-once semantics, event-time processing, and highly scalable state management via RocksDB. For data engineering teams, it is the tool of choice when data must be transformed, enriched, or aggregated the moment it is generated.

02How it works in practice

A typical Flink pipeline consists of a Source, Transformations (Operators), and a Sink. In a scraping context, the Source is usually an Apache Kafka topic receiving raw JSON payloads from the crawler fleet. Flink operators parse the JSON, filter out malformed records, deduplicate against historical state, and compute rolling aggregates. Finally, the Sink writes the clean records to an operational database (like PostgreSQL) or a data lakehouse (like Apache Iceberg) for downstream consumption.

03Windowing and Event Time

Flink excels at handling out-of-order data through its concept of Event Time and Watermarks. If a scraper is delayed by a proxy timeout, the record it produces might arrive late to the stream. Flink groups data into "windows" (tumbling, sliding, or session) based on the timestamp embedded in the data itself, not the time the server received it. Watermarks tell Flink when it is safe to close a window and emit the result, ensuring accurate aggregations despite network chaos.

04How DataFlirt handles it

We deploy Flink clusters on Kubernetes to handle our highest-tier real-time delivery SLAs. When clients subscribe to live pricing feeds or inventory alerts, we route the extraction output directly into Flink. Our custom operators maintain a stateful cache of the last known price for millions of SKUs. When a new scrape arrives, Flink compares it to the state; if there is a delta, an alert is pushed to the client's webhook in under 50 milliseconds.

05Batch vs Stream unification

Historically, data teams maintained two separate codebases: one for real-time streams (e.g., Storm) and one for historical batch processing (e.g., Hadoop). Flink unifies this by treating batch processing as a special case of stream processing — a stream that happens to have a defined end. This allows engineers to write a data transformation logic once and execute it against both live Kafka feeds and historical S3 archives with identical semantics.

// 03 — stream metrics

Measuring
stream health.

Stream processing requires monitoring both throughput and latency. DataFlirt tracks these metrics per Flink job to ensure real-time SLAs are met even during massive crawl spikes.

Processing Latency = L = t_processed − t_ingested

Time taken for a record to pass through the Flink DAG. Target < 50ms. Stream performance baseline

Event-Time Skew = S = t_system − t_watermark

Measures how far behind reality the stream processor is due to late data. Flink watermark monitoring

State Size = M = keys × bytes_per_state

Memory required for stateful operations like 24-hour deduplication. RocksDB backend sizing

// 04 — flink job trace

A real-time
price aggregation DAG.

Trace of a Flink job consuming raw product scrapes, deduplicating by SKU, calculating 5-minute moving averages, and sinking to an Iceberg table.

Kafka SourceTumbling WindowIceberg Sink

edge.dataflirt.io — live

CAPTURED

// job initialization
job.id: "flink-price-agg-v4"
parallelism: 16
state.backend: "RocksDB"

// source: consuming raw scrapes
kafka.topic: "raw-scrapes-in"
records.ingested: 4,205/sec
watermark.delay: 12ms

// operator: stateful deduplication
operator.dedup: "keyBy(sku).filter(isNewPrice)"
state.size: 4.2 GB
records.dropped: 3,810/sec // no price change

// operator: tumbling window
window.type: "TumblingEventTime(5m)"
late_data.dropped: 12 // arrived past allowed lateness

// sink: writing to lakehouse
iceberg.commit: SUCCESS
checkpoint.duration: 412ms
pipeline.status: RUNNING

// 05 — failure modes

Where stream
jobs break.

Ranked by frequency of incidents in continuous scraping pipelines. Stateful stream processing introduces complexities that batch jobs simply don't have.

JOBS MONITORED · · · 120+ active

AVG LATENCY · · · · 45ms

UPDATED · · · · · · 2026-05-19

01

State backend bloat

OOM risk · Unbounded keys in deduplication exhaust RocksDB memory

02

Checkpoint timeouts

I/O bound · Slow writes to S3 cause snapshot failures and restarts

03

Watermark delays

logic error · Late-arriving scraped data stalls event-time windows

04

Backpressure

throughput · Sink database is slower than the Kafka source feed

05

Serialization overhead

CPU bound · Inefficient JSON parsing consumes worker threads

// 06 — our architecture

Sub-second latency,

at millions of events per hour.

DataFlirt uses Apache Flink to power our real-time delivery tier. When a client needs instant notifications for competitor price drops or out-of-stock events, batch processing is useless. We route raw extraction events through Kafka into Flink, where stateful operators maintain the last-known state of every SKU. If a new scrape shows a price delta, Flink emits an alert event instantly. Exactly-once semantics ensure you never get duplicate alerts, even if a worker node crashes mid-stream.

Flink Job Status

Live metrics from a production price-monitoring stream.

job.name df-price-monitor-eu

uptime 42d 18h

throughput 12,400 msg/s

backpressure none

checkpoint.status COMPLETED

late_records 0.01%

delivery.latency 38ms

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About stream processing, state management, exactly-once semantics, and how DataFlirt handles real-time data pipelines.

Ask us directly →

What is the difference between Apache Flink and Apache Spark? +

Spark was built for batch processing and handles streams via "micro-batching" (processing small chunks of data at intervals). Flink is a native stream processor that handles events one by one as they arrive. For true sub-second latency in scraping pipelines, Flink is the superior architecture. Spark is better suited for massive historical data transformations.

Do I need Flink for a daily scraping pipeline? +

No. If you are scraping a catalog once a day and delivering a CSV, Flink is massive overkill. Use Apache Airflow and standard batch processing. Flink is designed for continuous, unbounded data feeds where latency matters — like financial tick data, live inventory monitoring, or real-time news aggregation.

How does Flink handle late-arriving scraped data? +

Through watermarks and allowed lateness. Because scrapers can experience network delays or proxy retries, an event scraped at 10:00 might not hit Kafka until 10:02. Flink uses event-time processing (the timestamp of the scrape) rather than processing-time (when Flink saw it), ensuring windows aggregate correctly even if data arrives out of order.

Is Flink used for the actual web scraping? +

No. Flink is a data processing engine, not a web crawler. The fetch and extraction layers (using Playwright, Scrapy, or Go) push their structured output into a message broker like Kafka. Flink consumes from Kafka to clean, enrich, and route the data. It sits strictly in the post-extraction data engineering layer.

How does DataFlirt ensure exactly-once delivery? +

We use Flink's distributed checkpointing mechanism combined with transactional sinks (like Apache Iceberg or Kafka transactional producers). If a Flink worker crashes, the system rolls back to the last successful checkpoint and replays the stream from that exact offset. You never receive duplicate records or miss a scrape.

Are there legal implications to real-time data processing? +

The speed of processing doesn't change the legality of the scraping itself — public data remains public. However, real-time pipelines are often used for high-frequency trading, ticket scalping, or aggressive inventory hoarding. These use cases face intense scrutiny and aggressive anti-bot countermeasures from targets. We strictly vet real-time use cases to ensure they comply with target ToS and our own ethical guidelines.

$ dataflirt scope --new-project --target=apache-flink READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h