← Glossary / URL Discovery Rate

What is URL Discovery Rate?

URL discovery rate is the velocity at which a crawler identifies new, unvisited links within a target domain or sitemap structure. It dictates the upper bound of your pipeline's throughput — you cannot extract data faster than you can find the pages that host it. For large-scale e-commerce or real estate pipelines, optimizing discovery is often more critical than raw extraction speed, as stale queues lead to missed inventory.

CrawlingThroughputSitemapsQueue ManagementPipeline Metrics
// 02 — definitions

Feeding the
extraction queue.

The speed at which your crawler maps the target's surface area, determining how fresh your downstream data can actually be.

Ask a DataFlirt engineer →

TL;DR

URL discovery rate measures how many unique, valid target URLs a crawler adds to the extraction queue per second. It is bounded by sitemap availability, pagination depth, and the target's internal link structure. A low discovery rate starves high-concurrency extraction workers, leaving expensive compute idling while waiting for links.

01Definition & pipeline role

URL discovery rate is the metric defining how quickly a crawler can identify valid target pages and push them into an extraction queue. It is the first bottleneck in any scraping pipeline.

If you have 100 extraction workers capable of scraping 10 pages per second each, your extraction capacity is 1,000 URLs/sec. But if your discovery crawler is stuck navigating a slow pagination sequence and only finding 15 URLs/sec, your pipeline is starved. 98% of your extraction capacity will sit idle.

02Sitemaps vs. HTML traversal

There are two primary methods for discovery, with vastly different rate profiles:

  • Sitemap ingestion: Fetching sitemap.xml files. This is a batch process. A single HTTP request can yield 50,000 URLs. The discovery rate is effectively limited only by network bandwidth and XML parsing speed.
  • HTML traversal: Loading category pages, extracting product links, and following the "Next Page" button. This is sequential. You cannot fetch page 3 until page 2 returns. The discovery rate is severely limited by the target server's response time and rate limits.
03The canonicalization trap

A high raw discovery rate is useless if the URLs are duplicates. E-commerce sites often append tracking parameters or category paths to URLs (e.g., /shoes/sneaker-x vs /sale/sneaker-x). If your crawler doesn't normalize these to a canonical format before adding them to the queue, your discovery rate looks artificially high, but you end up paying to extract the exact same data multiple times.

04How DataFlirt handles queue feeding

We strictly decouple discovery from extraction. Our discovery nodes are lightweight Go processes optimized for fast HTTP/2 multiplexing and XML/HTML parsing. They feed a central Redis cluster.

Before any URL hits the queue, it passes through a normalization layer and a distributed Bloom filter to guarantee uniqueness. If the queue depth drops below a critical threshold, our orchestrator automatically spins up additional discovery workers to parallelize category traversal, ensuring the heavy extraction workers never starve.

05Did you know?

Many modern anti-bot systems (like Akamai and DataDome) apply stricter rate limits to category and search pages than they do to individual product pages. They know that scrapers must hit category pages heavily during the discovery phase. By throttling pagination requests, they can effectively choke a scraper's throughput without blocking it entirely.

// 03 — the math

How fast are you
finding targets?

Discovery metrics separate the cost of navigating a site from the cost of extracting its data. DataFlirt monitors these ratios to auto-scale discovery workers independently from extraction pools.

Effective Discovery Rate = Rdisc = (URLsfoundURLsdupe) / Tcrawl
Net new URLs added to the queue per second. Duplicates must be filtered before counting. Standard crawl metric
Queue Starvation Ratio = S = Rextract / Rdisc
If S > 1, extraction workers will eventually idle waiting for the crawler. DataFlirt pipeline health SLO
Discovery Cost Overhead = C = (Reqnav × Costreq) / URLstarget
The proxy and compute cost of navigating category pages just to find one product URL. FinOps scraping model
// 04 — queue ingestion trace

From sitemap to
deep pagination.

A live trace of a discovery worker populating a Redis queue. Notice the massive rate difference between batch sitemap ingestion and sequential HTML pagination crawling.

Redis QueueBloom FilterAsync Crawl
edge.dataflirt.io — live
CAPTURED
// phase 1: sitemap ingestion
fetch: "https://target.com/sitemap_index.xml"
parse: 42 sub-sitemaps found
queue.add: 420,000 URLs (batch)
rate.sitemap: 14,000 URLs/sec

// phase 2: deep pagination traversal (fallback)
fetch: "https://target.com/category/electronics?page=1"
extract.links: 24 product URLs, 1 next_page URL
filter.bloom: 4 dupes dropped, 20 queued
rate.html: 12 URLs/sec // bounded by sequential page loads

// queue health monitor
metric.discovery_rate: 12.4 URLs/sec
metric.extraction_rate: 45.0 URLs/sec
alert: queue starvation imminent (S = 3.6)
action: scaling discovery workers to 8
// 05 — discovery bottlenecks

What limits your
discovery rate.

The structural and technical barriers that prevent a crawler from finding URLs faster. Ranked by frequency of impact across DataFlirt's retail and real estate pipelines.

PIPELINES MONITORED ·   450+ active
AVG SITEMAP RATE ·  ·  ·  8k+ URLs/s
AVG HTML RATE ·  ·  ·  ·  15 URLs/s
01

Deep pagination chains

structural · Page 2 must load before Page 3 is known
02

Sitemap absence / staleness

structural · Forces reliance on slow HTML crawling
03

JavaScript-rendered links

technical · Requires headless browser just to find hrefs
04

Anti-bot rate limits

security · Category pages often have stricter limits than products
05

High duplication / canonicals

compute · Wasting cycles processing URLs that map to the same item
// 06 — architecture

Separate discovery from extraction,

because they scale on entirely different curves.

A common anti-pattern is coupling URL discovery and data extraction in the same worker process. Discovery is highly sequential — you must load page 1 to find the link to page 2. Extraction is embarrassingly parallel — 10,000 product pages can be scraped simultaneously. DataFlirt decouples these phases. Lightweight, low-cost discovery workers traverse categories and sitemaps, feeding a distributed Redis queue. Heavyweight extraction workers, equipped with residential proxies and headless browsers if necessary, consume that queue. This prevents expensive extraction compute from blocking on sequential navigation.

Discovery Worker Status

Live telemetry from a dedicated discovery node feeding a retail pipeline.

worker.type discovery-node-04
target.domain example-retail.in
discovery.method html-pagination
rate.current 18.5 URLs/sec
queue.depth 142,050
bloom_filter.dupe 14.2% dropped
status feeding extraction pool

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about queue management, sitemap vs. HTML crawling, and scaling discovery.

Ask us directly →
What is the difference between discovery and extraction? +
Discovery is the process of finding the URLs you want to scrape (e.g., navigating category pages to collect product links). Extraction is the process of downloading those specific URLs and parsing the data (price, title, stock) from them. They require different scaling strategies and proxy profiles.
Why is my discovery rate so much slower than my extraction rate? +
Because discovery is often sequential. If a category has 10,000 items displayed at 20 per page, you must make 500 sequential HTTP requests to find all the links. You cannot request page 500 until you know the URL format, which you often only get from page 499. Extraction, however, can process all 10,000 discovered links in parallel.
How do sitemaps affect discovery rate? +
Sitemaps bypass the sequential pagination problem entirely. Instead of making 500 requests to find 10,000 links, you make one request to an XML file and ingest all 10,000 links instantly. When available and fresh, sitemaps increase discovery rates by orders of magnitude.
How does DataFlirt handle infinite scroll for discovery? +
We rarely use headless browsers to physically scroll pages just to find links. Instead, we intercept the underlying XHR or Fetch requests that the page uses to load the next batch of items. We then replicate those API calls directly, parsing the JSON responses to extract URLs at a fraction of the compute cost.
Is it legal to crawl a site just to discover URLs? +
Yes, URL discovery is the fundamental mechanism of web crawling, identical to how search engines operate. As long as you respect robots.txt directives, honor Crawl-delay, and do not bypass authentication to access private areas, mapping public URLs is standard practice.
How do you prevent infinite crawler traps during discovery? +
We implement strict depth limits, URL pattern matching, and global Bloom filters. If a site dynamically generates infinite calendar URLs or recursive category filters, the pattern matcher flags the anomaly, and the Bloom filter ensures we never queue the same canonical destination twice.
$ dataflirt scope --new-project --target=url-discovery-rate READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h