← Glossary / Recursive Crawling

What is Recursive Crawling?

Recursive crawling is the process of extracting URLs from a fetched web page and adding them to a queue to be fetched subsequently, creating a self-sustaining discovery loop. Unlike sitemap-driven crawls that rely on publisher-provided lists, recursive spiders traverse the actual DOM graph. It is the foundational mechanism for broad web discovery, but without strict depth limits and deduplication, it inevitably spirals into infinite loops and pipeline crashes.

URL DiscoveryGraph TraversalQueue ManagementDepth LimitsDeduplication
// 02 — definitions

Follow the
links.

How a scraper navigates the web's graph by treating every page as both a data source and a directory of future targets.

Ask a DataFlirt engineer →

TL;DR

Recursive crawling parses HTML for href attributes, normalises the URLs, filters them against scope rules, and pushes them to a frontier queue. It is essential for targets without sitemaps, but requires aggressive deduplication and depth bounding to prevent the crawler from getting trapped in calendar widgets or infinite pagination loops.

01Definition & structure

Recursive crawling is a graph traversal technique where a scraper fetches a seed URL, extracts all outbound links, and adds the unseen ones to a queue. The scraper then fetches the next URL from the queue, repeating the process until the queue is empty or a predefined limit is reached.

It requires three core components:

  • Extractor — parses the DOM and pulls href attributes.
  • Normaliser & Filter — cleans the URLs and drops out-of-scope links (e.g., external domains).
  • Frontier Queue — stores the pending URLs and tracks the "visited" state to prevent infinite loops.
02The frontier queue

The frontier is the brain of a recursive crawl. It dictates the traversal strategy. A First-In-First-Out (FIFO) queue results in a breadth-first search, exploring the site level by level. A Last-In-First-Out (LIFO) stack results in a depth-first search, plunging down a single path until it hits a dead end. Production crawlers almost exclusively use breadth-first queues, often augmented with priority scoring to fetch high-value category pages before deep product variants.

03Scope and depth bounding

Without boundaries, a recursive crawler will attempt to download the entire internet. Scope bounding restricts the crawler to specific domains, subdomains, or URL path patterns (e.g., only URLs containing /product/). Depth bounding tracks how many clicks away a URL is from the seed. If the depth limit is 5, any link found on a level-5 page is discarded, preventing the crawler from falling into endless dynamic directories.

04How DataFlirt handles it

We treat recursive discovery and data extraction as separate pipeline phases. Our discovery workers aggressively crawl the target using lightweight HTTP clients, pushing normalised URLs through a distributed Bloom filter into a central Kafka topic. Once the discovery phase maps the target's boundaries, the heavy extraction workers (often running headless browsers) consume the deduplicated queue. This separation prevents expensive browser instances from wasting time on URL discovery.

05The infinite calendar trap

A classic spider trap is a dynamically generated calendar widget. The page for "May 2026" contains a link to "June 2026", which links to "July 2026", ad infinitum. Because the URL changes every time (?month=06&year=2026), exact-match deduplication fails. The crawler will happily queue pages for the year 25,000 unless a strict depth limit or URL pattern exclusion is enforced.

// 03 — crawl math

How fast does
the queue grow?

Recursive crawls expand exponentially. Managing a frontier queue requires modeling the branching factor and deduplication efficiency to provision enough memory and worker concurrency.

Queue Growth Rate = ΔQ = R · (b · (1d) − 1)
R=req/s, b=links/page, d=dedup rate. If ΔQ > 0, the queue is growing. Standard Crawl Frontier Model
Maximum Crawl Depth = Dmax = logb(N)
N=target pages, b=effective branching factor. Determines when to prune. Graph Theory
DataFlirt Bloom Filter Memory = M = −(n · ln(p)) / (ln(2)2)
n=expected URLs, p=false positive rate. Keeps 'visited' state out of RAM. Internal Infrastructure Sizing
// 04 — the frontier

Processing a single
node expansion.

A worker fetches a category page, extracts product links, normalises them, and pushes unseen URLs to the Redis frontier.

Redis QueueBloom FilterDepth: 3
edge.dataflirt.io — live
CAPTURED
// fetch node
GET /category/electronics HTTP/2
status: 200 OK

// extract & normalise
links.raw: 142
links.in_scope: 86 // dropped external & mailto
links.normalised: 86 // stripped fragments & tracking params

// deduplication (Bloom Filter)
filter.check_batch: 86
filter.seen: 71
filter.novel: 15

// queue push
redis.lpush: 15
queue.depth_level: 4
queue.total_size: 1,402,819
worker.status: ready
// 05 — failure modes

Where recursive
crawls die.

Unbounded recursion is a memory leak waiting to happen. These are the most common traps that cause recursive spiders to crash or get permanently blocked.

CRAWLS MONITORED ·  ·  ·  12k+ daily
TRAP RATE ·  ·  ·  ·  ·   14% of targets
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Infinite pagination loops

logic flaw · page=9999 returns page 1 content
02

Calendar / Date widgets

spider trap · next_month.php generates infinite URLs
03

Session IDs in URLs

state leak · defeats exact-match deduplication
04

Faceted search permutations

combinatorics · color=red&size=L vs size=L&color=red
05

Subdomain wildcards

DNS trap · routing random subdomains to dynamic pages
// 06 — queue architecture

Distributed frontiers,

scaling recursion across thousands of workers.

A naive recursive crawler keeps its visited list in memory. At a million URLs, it consumes gigabytes; at ten million, the worker crashes. DataFlirt decouples the fetcher from the frontier. Workers are stateless. They extract links and push them to a central Kafka topic. A dedicated routing tier normalises, checks a distributed Bloom filter, enforces depth limits, and schedules the novel URLs back onto the fetch queues based on domain rate limits.

Frontier Router Status

Live metrics from a distributed recursive crawl.

crawl.id rec-ecom-092
queue.strategy breadth-first
bloom_filter.size 512 MB
urls.discovered 42,109,441
urls.novel 3,812,004
trap.detected calendar_widget
worker.nodes 140 active

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About graph traversal, deduplication, spider traps, and how DataFlirt manages unbounded discovery at scale.

Ask us directly →
Should I use breadth-first or depth-first traversal? +
Breadth-first search (BFS) is the standard for web crawling. It ensures you capture the most important, highly-linked pages (like categories and top-level products) before getting bogged down in deep, obscure pagination tails. Depth-first search (DFS) is prone to getting stuck in infinite loops or spider traps before it extracts any meaningful data.
How do you deduplicate URLs with tracking parameters? +
Through aggressive URL normalization before the deduplication check. We strip fragments (#section), remove known tracking parameters (utm_source, gclid), and sort query parameters alphabetically. ?b=2&a=1 and ?a=1&b=2 must hash to the same value, or your queue will explode with duplicates.
Is recursive crawling legal? +
Accessing public data is generally lawful, but aggressive recursion without rate limits can trigger Computer Fraud and Abuse Act (CFAA) or trespass to chattels claims if it degrades the target's server performance. Respecting robots.txt and implementing strict concurrency limits is essential for compliance.
How does DataFlirt handle infinite pagination loops? +
We use content hashing. If the extracted payload of page N hashes to the exact same value as page N-1 (or if the item count drops to zero but the "Next" link is still present), our router automatically prunes that branch and stops queueing subsequent pages.
Why use recursive crawling if the site has a sitemap? +
Sitemaps are often stale, incomplete, or missing entirely. Publishers frequently fail to update them when new inventory is added, or they intentionally omit certain pages. Recursion provides the ground truth of what is actually linked and accessible on the site.
How do you manage memory for the 'visited' list at scale? +
We never use local memory arrays or standard database tables for the visited set. For crawls under 10 million URLs, we use Redis sets. For massive crawls, we use distributed Bloom filters, which can check membership for billions of URLs in milliseconds using a fixed, predictable memory footprint.
$ dataflirt scope --new-project --target=recursive-crawling READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h