← Glossary / Apache Kafka

What is Apache Kafka?

Q: How do you handle scraping full pages that exceed Kafka's 1MB limit?

You have two options: increase max.message.bytes (which degrades cluster performance), or use the Claim Check pattern. In the Claim Check pattern, the scraper writes the heavy HTML blob to an S3 bucket and publishes a tiny message to Kafka containing the S3 URI. The consumer reads the URI from Kafka and streams the payload directly from S3.

Apache Kafka is a distributed event streaming platform used in scraping infrastructure to decouple the fetch, extract, and delivery layers. Instead of a scraper writing directly to a database and crashing when the connection drops, it publishes raw HTML or parsed JSON payloads to a Kafka topic. It acts as the high-throughput, fault-tolerant nervous system of a data pipeline, ensuring no scraped record is ever lost in transit.

Event StreamingPub/SubDecouplingHigh ThroughputData Engineering

// 02 — definitions

Decouple the
pipeline.

Why point-to-point scraping architectures fail at scale, and how an immutable commit log prevents data loss during downstream outages.

Ask a DataFlirt engineer →

TL;DR

Apache Kafka replaces brittle, synchronous scraping pipelines with an asynchronous, event-driven architecture. Scrapers act as producers, pushing raw or extracted data to topics. Downstream consumers (parsers, validators, database loaders) read at their own pace. If your database goes down, scrapers keep running; the data just waits in Kafka.

01Definition & structure

Apache Kafka is a distributed event streaming platform. In a scraping context, it acts as a buffer between systems operating at different speeds. It consists of Topics (categories of data, like "raw-html"), Partitions (shards of a topic for parallel processing), Producers (scrapers pushing data), and Consumers (extractors reading data). Unlike traditional queues, Kafka stores messages on disk for a configured retention period, allowing multiple consumers to read the same data independently.

02How it works in practice

A typical Kafka-backed scraping pipeline has three stages. First, a fleet of lightweight fetchers requests URLs and publishes the raw HTTP responses to an ingest topic. Second, a consumer group of extraction workers reads this topic, applies CSS/XPath selectors, and publishes structured JSON to a parsed-data topic. Finally, a delivery worker reads the parsed topic and bulk-inserts the records into a data warehouse like Snowflake or BigQuery.

03Partitioning for scraping scale

Kafka guarantees message ordering only within a single partition. If you need to process updates for a specific product in the exact order they were scraped, you must use the product ID or URL as the message key. This ensures all updates for that item route to the same partition. However, if you key by domain name, a massive site like Amazon will flood a single partition, causing severe consumer lag while other partitions sit idle.

04How DataFlirt handles it

We run multi-tenant Kafka clusters to orchestrate millions of daily scrape jobs. To protect the pipeline, we enforce strict schema validation at the producer level using a Schema Registry. If a scraper tries to publish a payload that doesn't match the expected Avro or Protobuf schema, the message is rejected before it ever enters Kafka, preventing downstream consumers from crashing on malformed data.

05The message size trap

Kafka is optimized for millions of small messages (under 100 KB). Scraping full-page HTML, base64-encoded images, or PDF documents often exceeds the default 1MB limit. Forcing Kafka to handle 10MB payloads causes severe memory pressure and garbage collection pauses on the brokers. The standard fix is the Claim Check pattern: store the heavy blob in S3 and pass only the lightweight S3 URL through the Kafka topic.

// 03 — the streaming math

Sizing a scraping
cluster.

Kafka scales horizontally, but only if you partition correctly. DataFlirt's infrastructure team uses these models to provision topics for high-volume catalog crawls.

Required Partitions = P = max(T_p / p, T_c / c)

Target throughput divided by single-producer (p) or single-consumer (c) capacity. Kafka scaling heuristic

Consumer Lag = L = Offset_latest − Offset_consumer

If L grows continuously, your extraction workers are slower than your fetch workers. Pipeline health metric

Retention Storage = S = Rate × Size × Time × Replicas

Disk required to store raw HTML for 7 days of replayability. Capacity planning

// 04 — pipeline trace

From raw bytes
to parsed records.

A trace of a single scraped product page flowing through a Kafka-backed architecture. The fetcher produces raw HTML, the extractor consumes it and produces a structured JSON record.

Producer APIConsumer GroupOffset Commit

edge.dataflirt.io — live

CAPTURED

// producer: fetch worker
topic: "raw-html-ingest"
key: "example.com/product/123"
payload_size: 142 KB
status: ACK_ALL

// consumer: extraction worker
group_id: "extractor-group-v4"
offset_commit: 8492011
schema_validation: PASS

// producer: delivery worker
topic: "parsed-records-out"
payload: {"price": 29.99, "stock": true}
lag_monitor: 12ms // healthy

// 05 — failure modes

Where streaming
pipelines break.

Kafka is bulletproof, but scraping workloads introduce unique anti-patterns. Ranked by frequency of incidents in poorly tuned data pipelines.

CLUSTER UPTIME · · · 99.99%

AVG MSG SIZE · · · · 45 KB

UPDATED · · · · · · 2026-05-19

Message size limits exceeded

configuration · Raw HTML blobs exceeding the default 1MB max.message.bytes limit

Consumer lag / extraction bottleneck

compute · Fetch workers outpace CPU-heavy parsing workers

Partition skew

routing · Keying by domain sends 90% of traffic to a single partition

Poison pill messages

schema · Unhandled deserialization errors blocking the consumer queue

Offset commit failures

logic · Processing timeouts causing duplicate record extraction

// 06 — architecture

Fetch fast,

process at your own pace.

In a synchronous pipeline, a slow database insert slows down your web scraper, increasing the risk of proxy timeouts and bot detection. By introducing Kafka, DataFlirt isolates the volatile network layer from the heavy compute layer. Our fetchers dump raw bytes into Kafka and immediately move to the next URL. Extraction workers consume those bytes asynchronously, scaling up and down based on topic lag without ever touching the target website.

kafka-topic-status

Live metrics for a high-volume raw HTML ingestion topic.

topic.name df-raw-ingest-prod

partitions 128

replication.factor 3fault-tolerant

retention.ms 6048000007 days

consumer.lag 4,102 msgsauto-scaling

max.message.bytes 52428805MB override

throughput.in 14.2 MB/s

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about using Kafka for web scraping, handling large payloads, and managing consumer lag.

Ask us directly →

Why use Kafka instead of RabbitMQ or Redis for scraping? +

Kafka is an immutable commit log, while RabbitMQ and Redis are ephemeral message queues. If your extraction logic has a bug and corrupts data, Kafka allows you to reset your consumer offset and replay the raw HTML from yesterday. With RabbitMQ, once the message is consumed, it's gone — you'd have to re-scrape the target website.

How do you handle scraping full pages that exceed Kafka's 1MB limit? +

You have two options: increase max.message.bytes (which degrades cluster performance), or use the Claim Check pattern. In the Claim Check pattern, the scraper writes the heavy HTML blob to an S3 bucket and publishes a tiny message to Kafka containing the S3 URI. The consumer reads the URI from Kafka and streams the payload directly from S3.

Does Kafka guarantee exactly-once processing for scraped data? +

Kafka supports exactly-once semantics via transactional APIs, but it's usually overkill for scraping. We rely on at-least-once delivery combined with idempotent downstream operations. If a consumer crashes and re-processes a message, the database layer handles it via an upsert (INSERT ON CONFLICT), ensuring no duplicates in the final dataset.

How does DataFlirt monitor and manage consumer lag? +

We export Kafka consumer group metrics to Prometheus. If the lag on the raw-html-ingest topic exceeds our 30-second threshold, our Kubernetes cluster automatically spins up additional extraction pods to chew through the backlog. Once the lag drops, the pods scale back down.

What is a 'poison pill' in a scraping pipeline? +

A poison pill is a message (like a malformed JSON payload or unexpected binary data) that causes the consumer to crash during deserialization. Because Kafka requires consumers to process messages in order, the consumer restarts, hits the same message, and crashes again. We prevent this by routing failed deserializations to a Dead Letter Queue (DLQ) and advancing the offset.

Is Kafka overkill for a small scraping project? +

Yes. If you are scraping 10,000 pages a day on a single server, stick to a simple Postgres table or Redis queue. Kafka introduces significant operational overhead. It becomes necessary when you hit millions of pages per day, require distributed workers, or need multiple independent systems (e.g., a parser, an ML classifier, and an archiver) to consume the same scraped data.

$ dataflirt scope --new-project --target=apache-kafka READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h

What is Apache Kafka?

Decouple thepipeline.

TL;DR

Sizing a scrapingcluster.

From raw bytesto parsed records.

Where streamingpipelines break.

Message size limits exceeded

Consumer lag / extraction bottleneck

Partition skew

Poison pill messages

Offset commit failures

Fetch fast,

kafka-topic-status

Stay ahead of the pipeline

Data engineeringintel, weekly.

Commonquestions.

Tell us whatto extract.We do the rest.

Related glossary terms

Message Queue

Dead Letter Queue

Apache Flink

Real-Time Data Pipeline