← Glossary / Deduplication Queue

What is Deduplication Queue?

A deduplication queue is the stateful gatekeeper of a web crawler, responsible for ensuring that a specific URL or payload is only processed once per crawl cycle. It sits between the link discovery phase and the fetch execution phase. Without a robust deduplication layer, crawlers fall into infinite pagination loops, waste proxy bandwidth on redundant header links, and inadvertently launch denial-of-service attacks against target servers.

RedisBloom FilterCrawl FrontierIdempotencyCost Optimization
// 02 — definitions

Stop fetching
the same bytes.

The memory layer that remembers what your crawler has already seen, preventing infinite loops and wasted proxy bandwidth.

Ask a DataFlirt engineer →

TL;DR

A deduplication queue checks every discovered URL against a fast, in-memory ledger before allowing it to be fetched. It is the primary defense against spider traps, pagination loops, and redundant compute. In a distributed crawl, this queue must be globally accessible and capable of handling tens of thousands of existence checks per second.

01Definition & structure
A deduplication queue (or "seen set") is an in-memory data structure that records every URL or payload identifier a crawler has processed. Before a new job is added to the active fetch queue, its identifier is checked against this set. If it exists, the job is dropped. It is typically implemented using Redis Sets for exact matching, or Bloom filters for memory-constrained, massive-scale crawls.
02How it works in practice
When a worker parses an HTML page, it might extract 200 links. It normalizes these links, hashes them (e.g., SHA-256), and sends a batch request to the deduplication store. The store replies with a boolean array indicating which hashes are genuinely new. The worker discards the known hashes and pushes only the new ones to the message broker (like RabbitMQ or Kafka) for fetching.
03The normalization prerequisite
A deduplication queue is entirely blind to semantics; it only compares bytes. If you feed it /item?color=red and /item?color=blue, it sees two different jobs. If the target site serves the exact same HTML for both, you've wasted a request. Robust deduplication requires a strict, target-specific URL normalization layer that strips irrelevant query parameters, standardizes protocols, and removes anchor fragments before the hash is generated.
04How DataFlirt handles it
We use a multi-layered approach. Every worker node runs an in-memory LRU cache to instantly drop highly repetitive links (like the site logo or main navigation) without hitting the network. For global state, we use a sharded Redis cluster executing atomic SADD operations via pipelines. For recurring pipelines, we namespace the Redis keys by crawl-epoch, allowing old state to expire automatically via TTLs without requiring manual cache invalidation.
05The spider trap problem
Without deduplication, crawlers fall into "spider traps." A classic example is a dynamically generated calendar widget where the "Next Month" link generates a new, valid URL infinitely (e.g., /events?month=12&year=2099). While deduplication stops exact URL loops, mitigating dynamic spider traps requires combining the deduplication queue with depth limits and strict URL pattern matching.
// 03 — the math

How much memory
does memory cost?

Storing millions of seen URLs requires efficient data structures. For small crawls, a Redis Set of SHA-256 hashes works. For billion-page crawls, DataFlirt shifts to Bloom filters to keep memory bounded.

Redis Set Memory (Exact) = M = N × (hash_bytes + redis_overhead)
10M URLs hashed via SHA-256 (32 bytes) + overhead ≈ 850 MB. Redis memory profiling
Bloom Filter False Positive Rate = p ≈ (1 − e−kn/m)k
Where k = hash functions, n = elements, m = bits. Trades exactness for O(1) memory. Burton H. Bloom, 1970
DataFlirt Dedupe Efficiency = E = 1 − (redundant_fetches / total_discovered_links)
E > 0.98 across our active discovery pipelines as of v2026.5. Internal SLO
// 04 — the frontier trace

Filtering 10k links
in 40 milliseconds.

A live trace of a worker parsing a category page, normalizing discovered URLs, and checking them against the global deduplication store before enqueuing.

Redis PipelineSHA-256URL Normalization
edge.dataflirt.io — live
CAPTURED
// link extraction complete
worker.id: "node-042-eu"
links.extracted: 142

// normalization phase
strip_params: ["utm_source", "session_id", "ref"]
force_scheme: "https"
drop_fragments: true

// redis pipeline check (SADD)
redis.cmd: "SADD crawl:IN:seen hash1 hash2 ... hash142"
redis.latency: 4.2ms
redis.response: [1, 0, 0, 1, 0, 1, ...] // 1 = new, 0 = seen

// enqueue phase
links.dropped: 138 // already seen (header links, footer, etc.)
links.enqueued: 4 // new product pages
queue.depth: 842,105
// 05 — failure modes

Why duplicates
still leak through.

Even with a fast deduplication queue, redundant fetches happen if the normalization logic fails to account for target-specific URL structures before hashing.

PIPELINES MONITORED ·   300+ active
DEDUPE ENGINE ·  ·  ·  ·  Redis Cluster
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Unstripped tracking parameters

utm_*, gclid · Changes the hash, bypassing the queue
02

Session IDs in URLs

?sid=12345 · Dynamic per fetch, creates infinite loops
03

Protocol & WWW mismatch

http vs https · example.com vs www.example.com
04

Trailing slash inconsistencies

/path vs /path/ · Server treats as same, hash treats as different
05

Distributed race conditions

check-then-act · Two workers check the same URL simultaneously
// 06 — our architecture

Global state,

local latency.

DataFlirt runs a two-tier deduplication architecture. Workers maintain a small local LRU cache for immediate intra-page duplicates (like the 'Home' link found on every single page), backed by a sharded Redis cluster for global state. This prevents hot-key bottlenecks when 500 concurrent workers all discover the exact same navigation link at the exact same millisecond. We hash URLs using SHA-256 post-normalization, ensuring the Redis keyspace remains uniform regardless of URL length.

dedupe-worker-04.log

Metrics from a single worker node during a high-concurrency discovery crawl.

worker.status activehealthy
local_lru.hits 4,291/secsaved network I/O
redis.sadd_ops 850/sec
redis.p99_latency 6.1ms
urls.normalized 100%strict mode
race_conditions 0.01%mitigated via pipeline
memory.utilization 84%stable

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About memory management, URL normalization, distributed state, and how DataFlirt prevents infinite crawl loops.

Ask us directly →
Should I use a Bloom filter or a Redis Set? +
For crawls under 50 million URLs, use a Redis Set. It provides exact deduplication (zero false positives) and allows you to easily inspect the frontier. For web-scale crawls (billions of URLs), the memory footprint of a Set becomes prohibitive, and a Bloom filter is required. You trade a tiny percentage of missed URLs (false positives) for a massive reduction in RAM.
How do you handle incremental crawls where URLs need to be re-fetched? +
We don't clear the deduplication queue; we use TTLs (Time-To-Live) or epoch-prefixed keys. If a product catalog needs to be re-crawled every 24 hours, the Redis key for a URL is set to expire in 23 hours. Alternatively, we append the crawl run ID to the hash (e.g., hash(url + "run_42")), creating a fresh deduplication space for the new cycle without dropping historical state.
What is URL normalization and why is it critical here? +
Normalization is the process of stripping meaningless variations from a URL before hashing it. https://site.com/item?utm_source=twitter and https://site.com/item return the exact same HTML, but hash differently. If you don't strip tracking parameters, session IDs, and trailing slashes before checking the deduplication queue, the queue is useless.
How does DataFlirt handle distributed race conditions? +
If two workers discover the same new URL at the exact same time, a naive "check-then-act" pattern will result in both workers enqueuing it. We use Redis SADD (Set Add), which is atomic. SADD returns 1 if the element was added, and 0 if it already existed. The worker only enqueues the URL if Redis returns 1.
Can I deduplicate by content instead of URL? +
Yes, but it happens later in the pipeline. URL deduplication prevents the HTTP request from happening at all. Content deduplication (hashing the DOM or extracted payload) happens after the fetch, and is used to detect when a site serves the same content on multiple distinct URLs (e.g., duplicate product listings). You need both.
What happens if the deduplication queue runs out of memory? +
If Redis hits its memory limit and starts evicting keys (OOM), the crawler will "forget" it has seen older URLs and start re-fetching them, leading to infinite loops. We monitor Redis memory strictly and use maxmemory-policy noeviction for dedupe clusters. If it fills up, the crawl pauses and alerts an engineer to scale the cluster, rather than silently looping.
$ dataflirt scope --new-project --target=deduplication-queue READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h