← Glossary / URL Deduplication

What is URL Deduplication?

URL deduplication is the process of identifying and discarding redundant links before they enter a crawler's fetch queue. Because modern web applications inject session IDs, tracking parameters, and dynamic routing into their links, a single product page might be reachable via hundreds of distinct URLs. Without an aggressive deduplication layer, your pipeline will waste proxy bandwidth fetching the same HTML repeatedly, inflating costs and polluting downstream datasets.

CrawlingQueue ManagementNormalizationBloom FiltersCost Optimization
// 02 — definitions

Stop fetching
the same page.

The mechanics of normalizing, hashing, and filtering URLs at scale so your crawler only spends proxy bandwidth on net-new content.

Ask a DataFlirt engineer →

TL;DR

URL deduplication ensures a crawler visits a unique page exactly once per run. It involves stripping tracking parameters, sorting query strings, and checking the normalized URL against a fast in-memory data structure like a Redis set or a Bloom filter. Failing to deduplicate properly is the leading cause of infinite crawl loops and blown proxy budgets.

01Definition & structure

URL deduplication is the mechanism a crawler uses to remember which pages it has already visited or queued. When a new link is extracted from a page, it is checked against a "seen registry". If it exists, the link is discarded. If it is new, it is added to the registry and pushed to the fetch queue.

Because raw URLs are unreliable identifiers, deduplication requires a normalization step first. Without normalization, /item?id=1&ref=home and /item?ref=home&id=1 would be treated as two different pages, causing redundant fetches.

02The Normalization Pipeline

Before checking the registry, a robust pipeline applies a sequence of transformations to the URL:

  • Fragment removal: Stripping anything after the # symbol, as fragments are client-side only.
  • Parameter filtering: Removing known analytics tags (utm_*, gclid) and session identifiers.
  • Query sorting: Alphabetizing the remaining query parameters to ensure consistent ordering.
  • Path normalization: Resolving ../ segments and standardizing trailing slashes.
03Memory Management at Scale

For small crawls, a simple in-memory hash set is sufficient. But as the crawl frontier grows into the millions, storing full URL strings becomes prohibitively expensive. Production systems hash the normalized URL (e.g., using SHA-256) and store only the hash. For extreme scale, probabilistic data structures like Bloom filters are used, which can track 100 million URLs in just a few hundred megabytes of RAM.

04How DataFlirt handles it

We decouple normalization from the crawl workers. Our edge nodes extract raw URLs and pass them to a dedicated normalization service. This service applies target-specific rules (e.g., "keep the variant parameter for Target A, drop it for Target B") and checks our centralized Redis cluster. By centralizing the seen state, we can scale our fetch workers horizontally without any risk of duplicate work.

05The Infinite Loop Trap

The most dangerous failure mode of poor deduplication is the infinite loop. This happens when a site dynamically generates links with unique parameters on every page load (like a timestamp or a randomized token). If your normalizer doesn't strip these dynamic parameters, every link looks unique, the seen registry grows infinitely, and your crawler gets trapped fetching the same template forever until it exhausts your proxy budget.

// 03 — the math

How much memory
does a queue need?

Storing millions of seen URLs in memory requires efficient data structures. DataFlirt uses Bloom filters for massive crawls to trade a tiny false-positive rate for massive memory savings.

Naive Set Memory = N × avg_url_length
10M URLs at 100 bytes each = ~1GB RAM just for the seen set. Standard Hash Set
Bloom Filter Size = m = (n · ln(p)) / (ln(2))2
m bits for n items with false positive rate p. 10M URLs at 0.1% FPR ≈ 17MB. Burton H. Bloom, 1970
Deduplication Ratio = 1 − (unique_fetched / total_discovered)
A high ratio means the site has poor internal linking hygiene. DataFlirt average is 68%. DataFlirt Crawl Metrics
// 04 — normalization pipeline

Stripping the noise
before the hash.

A raw URL discovered in the DOM passes through our normalization ruleset before being checked against the seen registry.

URL parsingquery sortingRedis check
edge.dataflirt.io — live
CAPTURED
// 1. raw discovered URL
url.raw: "https://shop.com/item?ref=tw&id=42&session=abc#top"

// 2. normalization phase
rule.strip_fragment: applied
rule.drop_params: ["ref", "session", "utm_*"]
rule.sort_query: applied

// 3. canonical form
url.canonical: "https://shop.com/item?id=42"

// 4. seen check (Redis)
hash.sha256: "8f4e...2b1a"
redis.sismember: 1 (true)
action: DROP // already in queue
// 05 — duplication sources

Where duplicate
URLs come from.

The most common reasons a single logical page generates multiple distinct URLs in a crawl queue. Ranked by frequency across DataFlirt's e-commerce pipelines.

PIPELINES ANALYZED ·  ·   412 active
AVG DEDUPE RATE ·  ·  ·   68.4%
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Tracking parameters

utm_*, ref, source · Marketing tags appended to internal links
02

Session IDs

sid, session · State tracking injected into href attributes
03

Query parameter order

?a=1&b=2 vs ?b=2&a=1 · Functionally identical, cryptographically distinct
04

Pagination variants

page=1 vs omitted · The first page often has multiple valid routes
05

Protocol / WWW

http vs https · Mixed absolute links across the target site
// 06 — our architecture

Normalize aggressively,

hash quickly, scale horizontally.

DataFlirt handles deduplication at the edge of the crawl frontier. Before a URL ever touches the distributed message queue, it passes through a target-specific normalization ruleset. We strip known junk parameters, sort the remainder, and hash the result. For crawls under 5 million pages, we use Redis sets for absolute precision. For massive 100M+ page discovery crawls, we switch to distributed Bloom filters, trading a 0.001% false-positive rate for a 98% reduction in memory overhead.

Deduplication worker metrics

Live telemetry from a URL normalization node on a real estate pipeline.

worker.id norm-node-04
urls.processed 14,205/sec
urls.dropped 9,841/sec
dedupe.ratio 69.2%
redis.latency 0.8ms
bloom.fpr 0.0001%
queue.status healthy

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About normalization rules, Bloom filters, infinite loops, and how DataFlirt manages massive crawl frontiers without blowing up memory.

Ask us directly →
Why not just use a standard Python set for seen URLs? +
A standard set works fine for 10,000 URLs. At 10 million URLs, the memory overhead of string objects in Python becomes a bottleneck, and a single-node set prevents you from scaling crawler workers horizontally. Production pipelines use Redis or Bloom filters so multiple workers can check the same seen registry concurrently.
What is a Bloom filter and why use it for deduplication? +
A Bloom filter is a probabilistic data structure that tells you if an item is 'definitely not' in a set, or 'probably' in a set. It uses a fraction of the memory of a standard hash set. The trade-off is a tiny false-positive rate (e.g., 0.001%), meaning you might accidentally skip 1 in 100,000 valid pages. For massive discovery crawls, that trade-off is highly profitable.
How do you handle URLs that look different but serve the same content? +
URL normalization. Before hashing, we strip fragments (#top), force lowercase on hostnames, remove trailing slashes, and drop known tracking parameters (utm_source, session_id). We also alphabetically sort query parameters so ?color=red&size=L hashes identically to ?size=L&color=red.
What happens if you deduplicate too aggressively? +
You miss data. If you configure your normalizer to strip a parameter like variant_id thinking it's a tracking code, your crawler will only fetch the first color of a product and drop the rest as duplicates. Normalization rules must be tailored and tested per target.
How does DataFlirt prevent infinite crawl loops? +
Infinite loops usually occur when a site dynamically generates URLs (e.g., a calendar appending ?month=next infinitely). We prevent this using strict URL depth limits, pattern-based exclusion rules (Disallow: /*?month=*), and anomaly detection that flags when a single domain generates an unexpected spike in unique URLs.
Do you deduplicate across different crawl runs? +
It depends on the pipeline objective. For a one-off discovery crawl, the seen set persists for the whole run. For a daily incremental scrape, we clear the seen set at the start of each day, but use HTTP conditional requests (ETags) to avoid re-downloading HTML that hasn't changed since yesterday.
$ dataflirt scope --new-project --target=url-deduplication READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h