← Glossary / Redis Queue

What is Redis Queue?

A Redis Queue is an in-memory data structure store used as a high-throughput message broker to distribute URLs and extraction tasks across a fleet of scraping workers. Because it operates entirely in RAM, it handles millions of push/pop operations per second with sub-millisecond latency. In distributed scraping, it is the central nervous system that prevents duplicate fetches, manages retries, and ensures workers never sit idle waiting for I/O.

Message BrokerDistributed CrawlingIn-MemoryTask QueueState Management
// 02 — definitions

The central
dispatcher.

How distributed scraping fleets coordinate millions of concurrent requests without stepping on each other's toes or fetching the same URL twice.

Ask a DataFlirt engineer →

TL;DR

A Redis queue decouples URL discovery from data extraction. Crawlers push discovered links to Redis; idle workers pop them off to fetch and parse. It provides atomic operations, meaning two workers will never pop the same URL, and its native Set data structures make O(1) deduplication trivial at scale.

01Definition & structure
A Redis Queue leverages the in-memory data structures of Redis to manage tasks across distributed systems. In a scraping context, it typically relies on three native structures:
  • Lists — Used as the actual queue (FIFO or LIFO) where URLs wait to be fetched.
  • Sets — Used for deduplication. A worker checks if a URL hash exists in the set before adding it to the list.
  • Sorted Sets — Used for priority queues, allowing high-value URLs (like category indexes) to jump ahead of deep product pages.
Because Redis is single-threaded and atomic, race conditions between workers are mathematically impossible.
02How it works in practice
A "seed" worker pushes the initial start URLs into the Redis list. Hundreds of extraction workers connect to Redis and issue a blocking pop command (BRPOP). When a URL appears, Redis hands it to exactly one worker. That worker fetches the page, extracts the data, and finds new links. It hashes the new links, checks the Redis deduplication set, and pushes any unseen links back into the queue. This cycle continues until the queue is empty.
03The OOM failure mode
The most common failure in distributed scraping is queue bloat. URL discovery is fast (parsing HTML), but extraction is slow (network I/O, anti-bot delays). If workers discover 100 links per page but only process 10 pages per second, the queue grows exponentially. Because Redis stores everything in RAM, it will eventually hit its memory limit and crash (Out of Memory). Proper architecture requires backpressure: pausing the discovery workers when the queue depth exceeds a safe threshold.
04How DataFlirt handles it
We treat Redis strictly as ephemeral state. Our pipelines use RedisBloom for deduplication, keeping memory usage flat regardless of crawl size. We implement strict queue depth limits — if the pending queue hits 5 million tasks, discovery pauses automatically. Furthermore, we use the reliable queue pattern: tasks are leased to workers with a TTL. If a worker pod is killed by the cloud provider, the lease expires and the task is safely returned to the pool.
05Did you know?
Redis is so fast that network latency is almost always the bottleneck, not the CPU. A standard Redis instance can handle over 100,000 operations per second. If your scraping queue feels slow, it is usually because your workers are located in a different AWS region or availability zone than your Redis cluster, adding 10-20ms of network round-trip time to every single push and pop.
// 03 — queue dynamics

How fast can
you dispatch?

Redis throughput is rarely the bottleneck in a scraping pipeline. Memory capacity and network latency between the broker and the worker fleet dictate your architectural limits.

Queue length over time = L(t) = L(t-1) + (ratepushratepop)
If push > pop consistently, you will eventually hit an OOM crash. Basic queuing theory
Memory footprint = M = queued_urls × avg_payload_bytes + redis_overhead
A 10M URL queue with 500B payloads requires ~5GB of RAM. Redis capacity planning
DataFlirt worker scaling = W = queue_depth / (target_drain_time × worker_throughput)
Our auto-scaler uses this to spin up Kubernetes pods dynamically. DataFlirt HPA logic
// 04 — redis-cli monitor

Dispatching tasks
at 12,000 ops/sec.

A live trace of a Redis cluster coordinating a distributed crawl across 400 worker nodes. Notice the atomic deduplication check before pushing to the queue.

redis-cliO(1) complexityatomic opsBRPOP
edge.dataflirt.io — live
CAPTURED
// Worker 1: Discovers a new URL, checks dedupe set
SADD "crawl:dupefilter" "8f2a9b... (hash)"
(integer) 1 // 1 means new, 0 means already seen

// Worker 1: Pushes to the pending queue
LPUSH "crawl:requests" "{url: 'https://target.com/p/123', depth: 2}"
(integer) 482910 // Current queue length

// Worker 42: Waiting for work, pops the task
BRPOP "crawl:requests" 0
1) "crawl:requests"
2) "{url: 'https://target.com/p/123', depth: 2}"

// Worker 17: Crashes mid-fetch. Reliable queue pattern (LMOVE) recovers it.
LMOVE "crawl:processing" "crawl:requests" RIGHT LEFT
"{url: 'https://target.com/p/099', retries: 1}" // Task re-queued
// 05 — failure modes

Why queues
collapse.

Redis is bulletproof until you misconfigure it. Ranked by frequency of pipeline outages across unmanaged distributed scraping setups.

INCIDENTS ANALYSED ·  ·   1,200+ pipelines
PRIMARY CAUSE ·  ·  ·  ·  Memory exhaustion
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Out of Memory (OOM)

fatal crash · Discovery outpaces extraction; RAM fills up
02

Unacknowledged task leaks

zombie state · Workers crash without releasing processing locks
03

Dedupe set bloat

slow degradation · Keeping 500M seen URLs in RAM without Bloom filters
04

Network partition

idle workers · Workers lose connection to the broker
05

AOF fsync blocking

latency spikes · Disk persistence blocking the single-threaded event loop
// 06 — our architecture

Decoupled state,

stateless workers.

DataFlirt runs a multi-tier Redis architecture. URL discovery and deduplication happen in a high-memory cluster using Bloom filters to keep the RAM footprint flat. The actual task dispatch runs on compute-optimized Redis nodes with strict TTLs. If a worker node dies mid-fetch, the task lease expires and another worker picks it up. The workers themselves are entirely stateless — they just wake up, ask Redis for work, and execute.

redis-cluster-metrics

Live telemetry from a production queue managing a 50M page crawl.

cluster.state ok
memory.used 14.2 GB / 32 GB
queue.depth 1,402,811 tasks
drain.rate 8,420 ops/sec
connected_clients 400 worker nodes
evicted_keys 0
dedupe.bloom_filter active · 0.01% error rate

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About message brokers, deduplication strategies, memory management, and how DataFlirt scales distributed crawls.

Ask us directly →
Why use Redis instead of RabbitMQ or Kafka for scraping? +
RabbitMQ and Kafka are excellent for durable, guaranteed message delivery. Redis is better for scraping because it provides native data structures (Sets for deduplication, Sorted Sets for priority queues) that message brokers lack. Scraping tasks are usually ephemeral — if you lose the queue, you just restart the crawl. The O(1) deduplication capability of Redis Sets makes it the undisputed standard for web crawling.
How do you handle deduplication without running out of RAM? +
A naive approach stores every visited URL string in a Redis Set. At 100 million URLs, that consumes gigabytes of RAM. Production pipelines hash the URLs (e.g., SHA-1) before storing them, or better yet, use RedisBloom (Bloom filters). A Bloom filter can track 100 million items with a 1% false positive rate using just ~120MB of memory.
What happens if a worker crashes while processing a URL? +
If you use a basic LPOP, the task is lost forever. Reliable queues use the RPOPLPUSH (or LMOVE in modern Redis) pattern. The task is atomically moved from the "pending" list to a "processing" list. If the worker finishes, it removes the task. If the worker crashes, a watchdog process eventually times out the task and moves it back to the pending queue.
How does DataFlirt scale workers based on queue depth? +
We use Kubernetes Event-driven Autoscaling (KEDA) tied directly to Redis list lengths. If the crawl:requests list grows beyond 10,000 items per active worker, the cluster automatically provisions more headless browser or HTTP worker pods. As the queue drains, the fleet scales back down to zero to minimize compute costs.
Can I use Redis for storing the extracted data? +
You can, but you shouldn't. Redis is expensive RAM. It is designed for fast, ephemeral state and coordination. Extracted data should be streamed directly to a durable, disk-based sink like S3, PostgreSQL, or BigQuery. Using Redis as a database for scraped records is the fastest way to trigger an OOM crash.
What eviction policy is best for scraping queues? +
For task queues, you must use noeviction. If Redis runs out of memory, it should return an error so your application can pause pushing tasks. If you use allkeys-lru or similar cache eviction policies, Redis will silently delete pending tasks or deduplication hashes to make room, causing your crawler to skip pages or enter infinite loops.
$ dataflirt scope --new-project --target=redis-queue READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h