← Glossary / Scrapy-Redis

What is Scrapy-Redis?

Scrapy-Redis is a library that replaces Scrapy's default in-memory request queue and duplicate filter with a centralized Redis backend. It enables distributed crawling by allowing multiple Scrapy spiders across different machines to share a single, persistent state. Without it, scaling a Scrapy project means partitioning URLs manually; with it, you get a unified, fault-tolerant cluster where worker nodes can be spun up or killed without losing the crawl state or duplicating requests.

Distributed CrawlingRedisMessage QueuePythonState Persistence
// 02 — definitions

State across
the cluster.

How a simple Python library turns a single-node scraper into a distributed, fault-tolerant crawling fleet.

Ask a DataFlirt engineer →

TL;DR

Scrapy-Redis swaps Scrapy's local scheduler and dupefilter for Redis-backed equivalents. It allows dozens of spider instances to consume from the same URL queue and check the same set of seen fingerprints. It's the standard path for scaling Python scraping pipelines beyond a single machine's CPU and memory limits.

01Definition & structure
Scrapy-Redis is an extension for the Scrapy framework that replaces its local, memory-bound components with Redis-backed equivalents. Specifically, it overrides the scheduler (which holds the queue of URLs to visit) and the dupefilter (which tracks which URLs have already been seen). By moving these components to a centralized Redis instance, multiple Scrapy processes can run concurrently on different servers while acting as a single, unified crawler.
02How it works in practice
You start by pushing a seed URL into a Redis list. All connected Scrapy workers are polling this list. One worker pops the URL, fetches the page, and extracts new links. Instead of queuing those new links locally, the worker pushes them back into Redis. Before a link is added to the queue, Redis checks the dupefilter set to ensure it hasn't been crawled yet. This cycle continues, allowing you to add or remove worker nodes at any time without disrupting the crawl.
03The deduplication bottleneck
The default Scrapy-Redis duplicate filter uses a Redis SET to store the SHA1 hash of every request fingerprint. While this guarantees 100% accuracy, it scales poorly. A crawl of 100 million pages will consume over 4GB of Redis RAM just for the duplicate filter. For enterprise-scale crawls, this memory footprint becomes the primary bottleneck, forcing engineering teams to implement probabilistic data structures like Bloom filters.
04How DataFlirt handles it
We run custom forks of Scrapy-Redis designed for massive scale. We replace the standard RFPDupeFilter with a Redis-backed Bloom filter, slashing memory usage by 97%. Because a single Redis thread maxes out around 100k operations per second, we shard our request queues and Bloom filters across a multi-node Redis Cluster. This architecture allows our pipelines to sustain 10,000+ requests per second without central state saturation.
05Did you know?
Scrapy-Redis distributes the requests, not the parsing workload. If one worker pops a URL that returns a massive, complex HTML document, and another pops a tiny JSON API response, the first worker will block on parsing while the second worker races ahead. To maximize cluster efficiency, you must ensure your parsing logic is highly optimized, otherwise you end up with a cluster where half the nodes are CPU-bound and the other half are idle.
// 03 — cluster sizing

How much Redis
do you need?

Scrapy-Redis memory consumption scales linearly with the number of discovered URLs. DataFlirt's infrastructure teams use these models to provision Redis clusters before launching billion-page crawls.

Redis memory (Default Dupefilter) = M = U × 42 bytes
40-byte SHA1 hash + Redis SET overhead per URL. 100M URLs = ~4.2GB RAM. Redis memory profiling
Queue throughput = T = W × (1 / Lavg)
Throughput equals worker count times the inverse of average request latency. Little's Law
DataFlirt Bloom Filter Savings = Mbloom = U × 1.2 bytes
97% memory reduction for 1B+ URL crawls by replacing the default SET. Internal benchmark
// 04 — cluster logs

Bootstrapping a
100-node crawl.

A look at the Redis monitor and Scrapy worker logs as a distributed crawl initializes and begins consuming the shared queue.

redis-cli monitorscrapy-redisworker-042
edge.dataflirt.io — live
CAPTURED
// Redis: Seed URL injection
LPUSH mycrawler:start_urls "https://target.com/category/all"

// Worker 042: Boot sequence
[scrapy-redis] Connected to Redis: redis://10.0.1.55:6379
[scrapy-redis] Using RedisScheduler and RFPDupeFilter

// Redis: Worker pops URL and checks dupefilter
LPOP mycrawler:requests
SADD mycrawler:dupefilter "a9f8e7d6c5b4a3f2e1d0" -> (integer) 1

// Worker 042: Fetch and extract
[scrapy.core.engine] Crawled (200) <GET https://target.com/category/all>

// Redis: Worker pushes discovered links
LPUSH mycrawler:requests "..." "..." "..." -> (integer) 482

// Cluster status
[scrapy-redis] Queue depth: 482
[scrapy-redis] Dupefilter size: 1
Cluster throughput: 1,204 req/s
// 05 — scaling limits

Where Scrapy-Redis
bottlenecks.

While Scrapy-Redis solves the single-machine problem, it introduces centralized state bottlenecks. These are the most common failure modes when scaling past 10,000 requests per second.

PIPELINES MONITORED ·   140+ clusters
MAX QUEUE DEPTH ·  ·  ·   2.4B URLs
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Redis single-thread CPU limit

throughput cap · O(1) operations still saturate at ~100k ops/s
02

Dupefilter memory exhaustion

OOM risk · Sets grow linearly, crashing the Redis instance
03

Network I/O overhead

bandwidth · Payload size between workers and Redis
04

Unbalanced priority queues

latency · ZSET operations are O(log N), slowing down pops
05

Zombie workers holding state

data loss · Connection drops leaving partial reads
// 06 — production architecture

Centralized state,

decentralized execution.

Standard Scrapy-Redis fails at enterprise scale because a single Redis instance cannot hold a billion SHA1 fingerprints in memory or handle the ZSET sorting overhead of a massive priority queue. DataFlirt replaces the standard RFPDupeFilter with a Redis-backed scalable Bloom filter, and shards the request queues across a Redis Cluster. This allows us to scale worker nodes dynamically based on target rate limits, without the central state ever becoming the bottleneck.

scrapy_redis.settings.py

DataFlirt's optimized Scrapy-Redis configuration for high-throughput pipelines.

SCHEDULER df_redis.scheduler.ClusterScheduler
DUPEFILTER_CLASS df_redis.dupefilter.BloomFilter
REDIS_CLUSTER_NODES 6 shards · 3 replicas
SCHEDULER_PERSIST True
REDIS_BLOOM_CAPACITY 5,000,000,000
REDIS_BLOOM_ERROR 0.000001
PIPELINE_THROUGHPUT 14,200 req/s

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About distributed crawling, Redis memory management, and how DataFlirt scales Scrapy pipelines to billions of URLs.

Ask us directly →
What is the difference between Scrapy and Scrapy-Redis? +
Scrapy is a web crawling framework that runs on a single machine, keeping its request queue and seen-URL list in local memory. Scrapy-Redis is a plugin that overrides those local components, moving the queue and duplicate filter to a Redis server. This allows multiple Scrapy instances to share the same crawl state.
Does Scrapy-Redis handle proxy rotation and anti-bot bypass? +
No. Scrapy-Redis only handles state distribution (queues and deduplication). You still need separate middleware for proxy rotation, TLS fingerprinting, and CAPTCHA solving. At DataFlirt, our Scrapy-Redis workers route all outbound requests through our proprietary proxy mesh to handle anti-bot challenges.
How do you prevent Redis from running out of memory on large crawls? +
The default Scrapy-Redis duplicate filter uses a Redis SET of SHA1 hashes. For 100 million URLs, this consumes gigabytes of RAM. We replace the default filter with a Redis-backed Bloom filter, which reduces memory consumption by over 95% while maintaining a mathematically guaranteed low false-positive rate.
Can I pause and resume a Scrapy-Redis crawl? +
Yes. Because the state lives in Redis, you can stop all your Scrapy workers, and the queue and duplicate filter will remain intact (provided SCHEDULER_PERSIST = True). When you restart the workers, they will reconnect to Redis and resume exactly where they left off.
Is scraping with a distributed cluster legal? +
The legality of scraping depends on the target data, jurisdiction, and compliance with terms of service, not the architecture of your crawler. However, distributed crawlers can easily overwhelm target servers (a DDoS-like effect). We strictly enforce concurrency limits and respect robots.txt Crawl-delay directives across our entire cluster to ensure lawful, non-disruptive access.
How does DataFlirt scale Scrapy-Redis beyond a single Redis instance? +
Standard Scrapy-Redis doesn't support Redis Cluster out of the box. We maintain a custom fork that shards request queues and Bloom filters across a multi-node Redis Cluster. This allows us to scale horizontally to thousands of worker nodes and billions of URLs without hitting the CPU or memory limits of a single Redis thread.
$ dataflirt scope --new-project --target=scrapy-redis READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h