← Glossary / Redis

What is Redis?

Redis is an in-memory data structure store used in scraping pipelines as a high-throughput message broker, deduplication cache, and distributed state manager. Because it keeps all data in RAM and operates single-threaded, it delivers sub-millisecond latency for queue operations. For distributed crawlers, it's the critical coordination layer that prevents multiple workers from fetching the same URL or stepping on each other's rate limits.

In-MemoryMessage QueueDeduplicationState ManagementRate Limiting
// 02 — definitions

State at
wire speed.

How distributed scraping fleets coordinate tasks, track seen URLs, and enforce global rate limits without bottlenecking on disk I/O.

Ask a DataFlirt engineer →

TL;DR

Redis acts as the central nervous system for distributed scraping. It holds the URL frontier, maintains the Bloom filter for deduplication, and tracks rolling rate-limit windows per target domain. Without an in-memory store like Redis, coordinating hundreds of concurrent workers becomes an I/O nightmare.

01Definition & structure
Redis (Remote Dictionary Server) is an open-source, in-memory key-value data store. Unlike traditional databases that write to disk, Redis keeps all data in RAM, enabling read and write operations in under a millisecond. It supports complex data structures like Strings, Hashes, Lists, Sets, and Sorted Sets, making it highly versatile for managing the state of distributed systems.
02The URL Frontier & Queues
In scraping, Redis is most commonly used to manage the URL frontier. A crawler pushes discovered URLs to a Redis List using LPUSH, and worker nodes pull URLs to scrape using RPOP. Because Redis is single-threaded, these operations are atomic — two workers will never pop the same URL simultaneously, ensuring clean task distribution across the fleet.
03Deduplication at scale
Crawlers naturally discover the same URLs repeatedly. To prevent infinite loops, every URL must be checked against a "seen" list. Using a Redis Set (SADD) works for small crawls, but for millions of URLs, memory consumption explodes. Production pipelines use the RedisBloom module, which compresses the deduplication state into a highly memory-efficient probabilistic filter.
04How DataFlirt handles it
We run dedicated Redis Clusters for our URL frontiers, completely isolated from the Redis instances we use for rate limiting. This prevents a massive queue backlog from evicting critical rate-limit tokens. We also heavily utilize Lua scripting to execute complex check-and-set operations (like acquiring a rate-limit token) atomically, eliminating race conditions between our distributed workers.
05The single-threaded trap
Because Redis processes commands sequentially on a single thread, running an O(N) command like KEYS * or fetching a massive Set with SMEMBERS will block the entire server. While that command runs, every other worker trying to pop a URL or check a rate limit is stalled. In production, always use cursor-based iterators like SCAN and SSCAN.
// 03 — memory math

Sizing the
frontier.

Redis memory is finite and expensive. Calculating the exact RAM footprint of your URL queue and deduplication set is mandatory before launching a multi-million page crawl.

Bloom Filter Size = M = (n · ln(p)) / (ln(2))2
n = items, p = false positive rate. Drastically reduces memory vs standard Sets. RedisBloom Module
Queue Memory = Mem = queued_urls × (avg_url_bytes + 64)
64 bytes represents the baseline Redis object overhead per list item. Redis Memory Optimization
Rate Limit Token = TTL = window_size_ms / allowed_requests
Leaky bucket interval. Keys must expire to prevent memory leaks. Distributed Rate Limiting
// 04 — redis-cli

Coordinating 400
workers in real time.

A live trace of a worker node popping a task, checking the deduplication filter, and acquiring a rate-limit token before fetching.

redis-cliO(1) operationsLua scripting
edge.dataflirt.io — live
CAPTURED
// 1. Pop next URL from the frontier
> RPOP pipeline:target:queue
"https://target.com/product/12345"

// 2. Check if already scraped (Bloom Filter)
> BF.EXISTS pipeline:target:seen "https://target.com/product/12345"
(integer) 0 // not seen

// 3. Acquire rate limit token via Lua script
> EVALsha 8f9a... 1 target.com 5 1000
(integer) 1 // token acquired, safe to fetch

// 4. Mark URL as seen
> BF.ADD pipeline:target:seen "https://target.com/product/12345"
(integer) 1

// worker executes HTTP fetch...
> HINCRBY pipeline:target:stats 200_ok 1
(integer) 1429
// 05 — bottlenecks

Where Redis
chokes.

Redis is fast, but it's single-threaded. Misusing data structures or running expensive commands blocks the entire cluster, stalling every worker in the fleet.

MAX OPS/SEC ·  ·  ·  ·    ~120k per shard
LATENCY SLO ·  ·  ·  ·    < 1.5ms
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

O(N) Command blocking

KEYS *, SMEMBERS · Blocks the single thread, stalling all workers
02

Memory eviction / OOM

RAM exhaustion · Crashes workers or silently drops deduplication history
03

Network bandwidth

Saturation · Storing large HTML payloads instead of just URLs
04

Hot key concentration

Unbalanced load · All workers hitting a single rate-limit key simultaneously
05

Connection pool exhaustion

Socket limits · Too many idle worker connections left open
// 06 — our architecture

Ephemeral state,

durable execution.

At DataFlirt, we treat Redis strictly as ephemeral state. It holds the active URL frontier, deduplication filters, and rate-limit counters. It never holds extracted data. By decoupling the coordination state (Redis) from the delivery sink (S3 or Kafka), we can lose a Redis node, rebuild the frontier from the durable database, and resume the crawl with minimal disruption. State is temporary; data is permanent.

Redis Cluster Status

Live metrics from the frontier cluster of a high-volume retail crawl.

cluster.state ok
keys.total 42,194,021
memory.used 18.4 GB
ops_per_sec 84,200
eviction.policy volatile-lru
blocked_clients 0
hit_rate 98.4%

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about using Redis for scraper coordination, memory management, and distributed rate limiting.

Ask us directly →
Why use Redis instead of RabbitMQ or Kafka for scraping queues? +
Redis is a data structure store, not just a queue. While Kafka is better for durable, append-only logs (like extracted data), Redis allows O(1) deduplication, priority sorting via Sorted Sets (ZSETs), and atomic rate limiting. It's the Swiss Army knife of crawler coordination.
What happens if Redis runs out of memory during a massive crawl? +
If configured with noeviction, Redis will reject write commands (OOM error), crashing your workers. If configured with an LRU policy, it will silently drop older keys — which means your deduplication filter loses history and you start re-crawling URLs. Always monitor memory and use Bloom filters for massive sets.
Is Redis suitable for storing the extracted HTML or JSON? +
No. Storing multi-megabyte HTML payloads in Redis will saturate your network bandwidth and exhaust RAM instantly. Store extracted payloads in S3 or a blob store, and only keep the URL or object ID in Redis.
How does DataFlirt handle global rate limiting across hundreds of IPs? +
We use Redis Lua scripts to evaluate token bucket algorithms atomically. Every worker requests a token for a specific target domain from Redis before dispatching the HTTP request. This ensures we never exceed the target's Crawl-delay or our internal stealth thresholds, regardless of how many workers are active.
Can I use Redis for persistent data? +
Redis supports RDB snapshots and AOF (Append Only File) for persistence, but it's fundamentally an in-memory store. If you need strong durability guarantees for your scraped data, use PostgreSQL, ClickHouse, or Snowflake.
What is a Bloom filter and why is it used in Redis? +
A Bloom filter is a probabilistic data structure that tests whether an element is a member of a set. It uses a fraction of the memory of a standard Redis Set. We use the RedisBloom module to track hundreds of millions of "seen" URLs using only a few megabytes of RAM, accepting a tiny, configurable false-positive rate.
$ dataflirt scope --new-project --target=redis READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h