← Glossary / Message Queue

What is Message Queue?

Message queue is the asynchronous communication layer that decouples URL discovery from data extraction in a scraping pipeline. Instead of a crawler fetching a page and immediately parsing it, it pushes the URL payload to a queue where idle worker nodes consume it at their own pace. This architecture absorbs traffic spikes, handles retries without blocking the main thread, and ensures that a temporary database outage doesn't crash an entire multi-day crawl.

InfrastructureRabbitMQKafkaAsynchronousDecoupling
// 02 — definitions

Decouple fetch
from extract.

Why synchronous scraping scripts fail at scale, and how queues introduce the backpressure needed to survive network volatility.

Ask a DataFlirt engineer →

TL;DR

A message queue acts as a buffer between pipeline stages. Crawlers push discovered URLs to the queue; fetchers pull URLs, download HTML, and push raw bytes to another queue; extractors pull bytes and push structured data. If the database goes down, the queue just fills up — no data is lost, and workers resume when the sink recovers.

01Definition & structure
A message queue is an asynchronous service-to-service communication method used in distributed systems. In a scraping context, it consists of:
  • Producers — components that create tasks (e.g., a crawler finding new URLs).
  • Broker — the central server (RabbitMQ, Redis, SQS) that holds the messages in memory or on disk.
  • Consumers — worker nodes that pull tasks from the queue, execute them (fetch/extract), and acknowledge completion.
This architecture ensures that producers and consumers don't need to know about each other, allowing them to scale independently.
02Why synchronous scraping breaks
A basic Python script runs synchronously: it fetches a URL, waits for the response, parses the HTML, writes to a database, and then moves to the next URL. If the database write takes 5 seconds, the network connection sits idle. If the script crashes on URL 500, you lose the state of the remaining 9,500 URLs. Queues solve this by turning the pipeline into a series of independent, non-blocking steps.
03Backpressure and rate limiting
Queues provide natural backpressure. If a target site slows down and your fetchers can only process 10 requests per second, but your crawler is discovering 100 URLs per second, the queue absorbs the difference. The crawler doesn't crash from memory exhaustion, and the target site isn't DDOSed. You control the exact concurrency by limiting the number of consumer workers attached to the queue.
04How DataFlirt handles it
We use a multi-tier broker architecture. RabbitMQ handles the control plane: URL discovery, retry loops, and proxy rotation signals. Apache Kafka handles the data plane: storing raw HTML payloads and extracted JSON records. This separation ensures that a massive spike in data volume (Kafka) never impacts the latency of critical control messages (RabbitMQ). Our workers scale dynamically based on the depth of the RabbitMQ queues.
05The poison pill problem
A "poison pill" is a message that consistently crashes the consumer attempting to process it (e.g., a malformed JSON payload that triggers an unhandled exception). If the consumer crashes before acknowledging the message, the broker requeues it. The next consumer picks it up, crashes, and the cycle repeats, eventually taking down the entire worker fleet. The solution is tracking delivery counts and routing messages to a Dead Letter Queue after a strict retry limit.
// 03 — queue dynamics

How deep should
your queue get?

Queue depth is a function of producer rate versus consumer rate. DataFlirt monitors these metrics to trigger auto-scaling events before memory limits are breached or data freshness SLAs are violated.

Queue Depth (Backlog) = D = ∫ (RinRout) dt
Accumulated difference between publish rate and acknowledgment rate. Queueing Theory
Little's Law (Latency) = L = λ × W
Average items in queue = arrival rate × average wait time. Operations Research
Worker Scaling Threshold = Wtarget = ⌈ D / (Tmax × Crate) ⌉
Workers needed to clear depth D within max acceptable time T. DataFlirt auto-scaler logic
// 04 — broker trace

Routing 10,000 URLs
through RabbitMQ.

A live trace of a URL payload moving from the discovery exchange to a worker queue, including a failed fetch that triggers a retry cycle.

AMQP 0-9-1RabbitMQPrefetch=10
edge.dataflirt.io — live
CAPTURED
// producer publish
exchange: "scrape.urls.direct"
routing_key: "target_amazon_in"
payload: {"url": "...", "retry_count": 0}

// consumer delivery
queue: "worker.fetch.high_priority"
delivery_tag: 84921
worker_id: "node-fetch-04"

// worker execution (HTTP 429 encountered)
status: HTTP 429 Too Many Requests
action: basic.nack(requeue=false)

// dead-letter routing for retry
x-death.reason: "rejected"
routed_to: "queue.retry.delay_60s" // TTL applied
status: message preserved for backoff
// 05 — failure modes

Where message
queues break.

Ranked by frequency of incidents across distributed scraping architectures. Unbounded queues and unacknowledged messages are the silent killers of pipeline memory.

BROKER INCIDENTS ·  ·  ·  Trailing 12 months
SEVERITY ·  ·  ·  ·  ·    Pipeline stall
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Consumer starvation (OOM)

Memory exhaustion · Publishers outpace consumers; unbounded queue crashes broker
02

Unacknowledged buildup

Prefetch limits · Workers crash without sending basic.ack; messages hang in limbo
03

Poison pill loops

Infinite retries · Malformed payload crashes worker, gets requeued, crashes next worker
04

Connection churn

TCP exhaustion · Opening a new AMQP connection per message instead of pooling
05

Network partition

Split-brain · Clustered brokers lose sync during transient network drops
// 06 — DataFlirt's broker layer

Buffer everything,

drop nothing.

DataFlirt runs a dual-broker architecture. High-throughput, append-only workloads like raw HTML storage use Apache Kafka, where consumer groups can replay days of data if an extraction schema changes. For complex routing — like prioritizing CAPTCHA-solving tasks or managing exponential backoff for rate-limited targets — we use RabbitMQ. Every message carries a trace ID, ensuring we can track a single extracted price back to the exact millisecond it was queued.

Broker health metrics

Live telemetry from our primary RabbitMQ cluster during a peak e-commerce crawl.

cluster.status 3 nodes · healthy
connections.active 1,240
publish.rate 14,200 msg/s
deliver.rate 14,185 msg/s
queue.depth.max 45,000 msgs
dlq.size 124 msgs
memory.alarm false

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About broker selection, worker scaling, retry logic, and how DataFlirt ensures zero data loss across distributed scraping pipelines.

Ask us directly →
Redis vs RabbitMQ vs Kafka for scraping? +
Redis (via Bull/Celery) is great for simple task queues but lacks advanced routing. RabbitMQ is the gold standard for scraping: it handles complex routing keys, priority queues, and delayed retries natively. Kafka is best for event streaming and log aggregation — we use it to store raw HTML payloads so we can replay extraction without re-fetching the target.
What happens if a worker crashes mid-scrape? +
If configured correctly, nothing is lost. The worker pulls the message but doesn't send an acknowledgment (basic.ack) until the data is safely written to the database. If the worker's TCP connection drops before the ack, the broker automatically requeues the message and delivers it to another healthy worker.
How do you handle rate limits with a queue? +
We use Dead Letter Exchanges (DLX) combined with message TTLs. If a worker hits a 429 Too Many Requests, it rejects the message. The broker routes it to a "delay queue" with a 60-second TTL. When the TTL expires, the message is automatically routed back to the active queue. This creates a non-blocking exponential backoff loop.
What is a Dead Letter Queue (DLQ)? +
A DLQ is where messages go to die — safely. If a URL fails to fetch after 5 retries, or if the payload is malformed and crashes the parser (a "poison pill"), the broker routes it to the DLQ. Engineers monitor the DLQ to debug edge cases without the bad messages clogging up the main pipeline.
How does DataFlirt scale workers based on queue depth? +
We use Kubernetes Event-driven Autoscaling (KEDA) tied to RabbitMQ metrics. If the worker.fetch queue depth exceeds 10,000 messages, KEDA spins up additional pod replicas. As the queue drains back to baseline, the pods scale down to zero, optimizing cloud compute costs.
Should I queue URLs or raw HTML? +
Both, but in different queues. The "Discovery Queue" holds lightweight URLs. The fetcher consumes URLs and pushes heavy HTML payloads to the "Extraction Queue". This prevents a slow CSS selector from blocking a fast HTTP request, allowing you to scale fetchers and extractors independently.
$ dataflirt scope --new-project --target=message-queue READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h