← Glossary / Retry Queue

What is Retry Queue?

A retry queue is a dedicated message broker channel that temporarily holds failed scraping tasks—timeouts, 429s, proxy drops—before re-injecting them into the worker pool. Without a robust retry mechanism, transient network failures translate directly into permanent data loss. It is the shock absorber between an unpredictable target server and a rigid data delivery contract, ensuring temporary hiccups don't compromise dataset completeness.

Message BrokerFault ToleranceRabbitMQExponential BackoffData Completeness
// 02 — definitions

Catching the
fall.

How scraping infrastructure handles the inevitable reality that the internet is hostile, proxies drop connections, and target servers rate-limit aggressively.

Ask a DataFlirt engineer →

TL;DR

A retry queue isolates failed requests from the main ingestion pipeline. Instead of blocking active workers or dropping the URL entirely, the task is parked with a backoff delay. It is the architectural difference between a 98% complete dataset and a 99.99% complete dataset.

01Definition & structure
A retry queue is a secondary message broker channel (often implemented in RabbitMQ, Redis, or Kafka) running parallel to the main task queue. It contains the original request payload alongside critical metadata: the current attempt count, the last error code received, and the scheduled execution time for the next attempt.
02How it works in practice
When a worker hits a transient error (like a 503 Service Unavailable), it acknowledges the message from the main queue but publishes a clone to the retry queue with attempt = attempt + 1. A delayed exchange holds the message for a calculated duration. Once the Time-To-Live (TTL) expires, the message is routed back to an available worker for another try.
03Backoff strategies and jitter
Linear retries (e.g., trying every 5 seconds) will hammer a struggling server. Exponential backoff increases the delay exponentially with each attempt, giving the target breathing room. Adding jitter (randomized variance) prevents the "thundering herd" problem, where hundreds of retried workers wake up at the exact same millisecond and immediately crash the target again.
04How DataFlirt handles it
We maintain distinct retry queues per failure domain. A proxy timeout gets retried immediately on a new IP. An HTTP 429 gets a 5-minute exponential backoff. An HTTP 403 Forbidden skips the retry queue entirely and alerts an engineer, as retrying a block usually results in a burned IP. Our scheduler dynamically adjusts these rules based on real-time target health.
05The poison pill problem
If a target permanently removes a page (returning a 404), retrying it 10 times wastes compute and proxy bandwidth. Retry queues must strictly filter by HTTP status code. Unresolvable tasks that hit their maximum attempt limit are routed to a Dead Letter Queue (DLQ) for manual inspection or automated pruning.
// 03 — the backoff math

Calculating the
wait time.

Retrying immediately usually guarantees a second failure. DataFlirt's scheduler calculates dynamic delays based on the error class and historical target recovery times.

Exponential Backoff with Jitter = T = min(Tmax, base · 2attempt) + jitter
Standard algorithm to prevent thundering herds when a target recovers. Network routing principles
Queue Depth Velocity = ΔQ = ratein − (ratesuccess + ratedlq)
If ΔQ stays positive for >5 mins, the target is down or the proxy pool is burned. DataFlirt infrastructure SLO
Ultimate Success Probability = P(success) = 1 − Π P(faili)
Why 3 retries usually pushes a 90% success rate to 99.9%. Probability theory
// 04 — worker trace

A URL's journey
through the queue.

Trace of a single product URL hitting a proxy timeout, entering the retry queue, applying backoff, and succeeding on the third attempt.

RabbitMQCelery workerexponential backoff
edge.dataflirt.io — live
CAPTURED
// attempt 1: main queue
task: fetch_product_page url: "https://target.com/p/1042"
proxy: residential_US_14
error: ProxyTimeout (10000ms)
action: route_to_retry attempt: 1 delay: 15s

// attempt 2: retry queue (15s later)
task: fetch_product_page url: "https://target.com/p/1042"
proxy: residential_US_89
error: HTTP 429 Too Many Requests
action: route_to_retry attempt: 2 delay: 60s // exponential backoff applied

// attempt 3: retry queue (60s later)
task: fetch_product_page url: "https://target.com/p/1042"
proxy: residential_US_211
status: HTTP 200 OK
action: extract_and_deliver
queue_status: ACK // removed from broker
// 05 — failure distribution

Why tasks enter
the retry queue.

The distribution of transient errors that trigger a retry across DataFlirt's infrastructure. Hard errors (like 404s or 400s) bypass this and go straight to the DLQ.

SAMPLE SIZE ·  ·  ·  ·    18.4M retries
WINDOW ·  ·  ·  ·  ·  ·   7d trailing
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Proxy connection timeout

~42.1% · Residential node went offline mid-request
02

HTTP 429 Too Many Requests

~28.4% · Target rate limit hit; requires backoff
03

Target server 503/502/504

~15.2% · Upstream infrastructure struggling
04

CAPTCHA challenge presented

~9.8% · Session burned; retry on clean IP
05

DOM selector timeout

~4.5% · Page loaded but JS render stalled
// 06 — queue architecture

Don't block the workers,

route the failures.

A naive scraper calls time.sleep() when it hits a rate limit, paralyzing the worker thread and wasting compute. DataFlirt's architecture is fully asynchronous. When a worker encounters a transient failure, it immediately publishes the task to a delayed exchange in RabbitMQ and moves to the next URL. The retry queue handles the waiting. This keeps our worker CPU utilization above 85% even when a target site is heavily throttling us.

Retry Exchange Status

Live metrics from the primary retry broker on a retail pipeline.

broker RabbitMQ 3.12
exchange.type x-delayed-message
messages.queued 14,205
publish.rate 112 msg/s
ack.rate 108 msg/s
dlq.routed 12 msg/s
worker.utilization 88%

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About retry logic, backoff strategies, queue management, and how DataFlirt prevents transient errors from ruining datasets.

Ask us directly →
What is the difference between a retry queue and a dead letter queue (DLQ)? +
A retry queue holds tasks that failed due to transient issues (timeouts, 503s) and will be attempted again. A Dead Letter Queue (DLQ) holds tasks that have permanently failed—either because they hit a hard error like a 404, or because they exhausted their maximum retry attempts. The DLQ requires human or automated intervention; the retry queue resolves itself.
How many times should a scraper retry a URL? +
It depends entirely on the error class. A proxy connection timeout should be retried 3–5 times immediately on new IPs. An HTTP 429 should be retried 2–3 times with a long exponential backoff. An HTTP 404 or 400 should be retried zero times. Blindly retrying every error 5 times is a great way to waste bandwidth and get banned.
Why use a dedicated retry queue instead of just sleeping in the code? +
Calling sleep blocks the thread or coroutine. If you have 100 workers and they all hit a 429 and sleep for 60 seconds, your entire pipeline throughput drops to zero. A dedicated queue frees the worker to process other URLs while the broker handles the delay logic.
How does DataFlirt handle target-wide outages? +
We use circuit breakers. If the retry queue depth spikes and the ratio of 503s exceeds 20% across the fleet, the circuit breaker trips. We pause the main queue and stop publishing to the retry queue to prevent hammering a downed target. Once health checks pass, the queues resume.
Can retries cause duplicate data in the final dataset? +
Yes, if the extraction succeeded but the database write timed out, a naive retry will insert the record twice. We prevent this by ensuring all pipeline writes are idempotent. Every record is hashed based on its unique identifiers, and we use upserts (INSERT ON CONFLICT) at the delivery layer.
What happens if the retry queue itself fills up? +
Queue depth is a primary scaling metric. If the retry queue grows faster than it drains, we autoscale worker pods. If the bottleneck is the target server (e.g., persistent 429s), throwing more workers at it makes it worse. In that case, we dynamically extend the TTL and backoff multipliers to stretch the load over a longer window.
$ dataflirt scope --new-project --target=retry-queue READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h