← Glossary / Web Crawling

What is Web Crawling?

Web crawling is the automated process of systematically following hyperlinks across pages to discover and fetch URLs at scale. Where HTML scraping extracts data from a known URL, a crawler builds and traverses the URL graph — starting from seed URLs, following links, respecting crawl boundaries, and managing the frontier of pages yet to be visited. For data pipelines, the crawl layer determines coverage: what you don't crawl, you don't extract.

InfrastructureURL FrontierCrawl BudgetPolitenessCoverage

// 02 — definitions

Follow
every link.

A crawler is a URL state machine. The complexity isn't in the fetching — it's in the frontier management, deduplication, politeness enforcement, and knowing when to stop.

Ask a DataFlirt engineer →

TL;DR

Web crawling systematically discovers and fetches pages by following links from a set of seed URLs. The core challenges are frontier management (what to fetch next), deduplication (don't fetch the same URL twice), politeness (don't hammer a server), and coverage (find everything relevant without fetching everything irrelevant). A poorly designed crawler either misses data or gets blocked — usually both.

01Definition & components

A web crawler has four core components:

Frontier — the queue of URLs to fetch, ordered by priority
Fetcher — the HTTP client (or browser) that retrieves pages
Link extractor — parses fetched pages to discover new URLs
URL store — tracks what's been seen to prevent re-fetching

The crawl loop: pop a URL from the frontier, fetch it, extract links, add unseen in-scope links to the frontier, mark the URL as fetched. Repeat until the frontier is empty or the budget is exhausted.

02Frontier management strategies

How you order the frontier determines what gets crawled when budget runs out:

BFS (breadth-first) — crawl by link depth. Good for discovering the full scope of a site.
DFS (depth-first) — follow one path to its end before backtracking. Rarely useful for data pipelines.
Priority-weighted — assign scores to URLs based on URL patterns, historical yield, or content signals. Fetch high-value pages first. This is the right approach for production pipelines with budget constraints.

03Crawl scope and boundary control

Without explicit scope boundaries, a crawler will follow links off-domain, into login pages, into infinite pagination loops, and into URL parameter traps that generate millions of near-identical pages.

Define scope with: domain allowlist, URL path patterns (include /products/*, exclude /account/*), max crawl depth, and URL parameter normalisation rules. A ?sort=asc and ?sort=desc version of the same page should resolve to one canonical URL in your seen-set, not two separate crawl jobs.

04How DataFlirt architects crawl jobs

We separate crawl jobs from extraction jobs architecturally. The crawler produces a URL manifest — a list of pages to extract from, with metadata about each URL's priority and expected content type. The extraction workers consume the manifest independently.

This means a crawl failure doesn't lose already-extracted data, and an extraction failure doesn't re-trigger a crawl. The two failure modes are isolated, monitored separately, and recovered independently. Most pipeline outages we've seen at other setups come from conflating the two layers.

05The URL trap that kills coverage

Faceted navigation — the filter UI on e-commerce category pages — generates an exponential URL space. A page with 10 filter dimensions, each with 5 values, has 5¹⁰ = ~9.7 million possible URL combinations, most pointing to the same 200 products in different sort orders.

Crawlers that don't normalise these URLs will spend their entire budget on filter permutations and never reach actual product pages. The fix: identify facet parameters by pattern (?color=, ?size=, ?page=), strip or canonicalise them before enqueueing, and crawl the canonical category URL once.

// 03 — the model

How crawlers
manage coverage.

Crawl efficiency is a tradeoff between coverage, freshness, and cost. These three models define the core decisions every crawler makes — and what DataFlirt's crawl scheduler optimises against for each pipeline.

Crawl coverage = C = pages_fetched / pages_in_scope

Coverage < 0.95 means you're missing data. Know your scope before measuring. Standard crawl metric

Crawl efficiency = E = useful_pages / total_pages_fetched

Low efficiency = wasted crawl budget on duplicates and out-of-scope pages. DataFlirt scheduler SLO

Politeness delay = D = max(robots_crawl_delay, server_response_time × k)

k ≈ 5–10. Politeness is both ethical and strategic — hammering gets you blocked. RFC 9309 / robots.txt spec

// 04 — crawl frontier trace

Seed URL to
crawl frontier.

Frontier state for a category-level crawl on a B2C e-commerce site. Shows URL discovery, deduplication, and politeness queue management across one crawl cycle.

BFS frontierdomain-scopedpoliteness enforced

edge.dataflirt.io — live

CAPTURED

// seed
seed.url: "myntra.com/men/shoes"
scope: "domain:myntra.com, path:/men/*"

// frontier — cycle 1
frontier.size: 1
fetched: "myntra.com/men/shoes"
links.discovered: 847
links.in_scope: 312
links.deduplicated: 289 // 23 already seen
frontier.new_size: 289

// politeness
robots.crawl_delay: 2s
response_time.p95: 340ms
effective_delay: 2.0s // robots value dominates

// cycle 1 result
pages.fetched: 1
pages.queued: 289
estimated_completion: ~9.6 min at 0.5 req/s

// 05 — crawl failure modes

Why crawlers
miss pages.

Coverage failures in production crawls. Most missed pages trace back to frontier management failures, not fetch failures — the crawler never tried to fetch them because they were never added to the frontier.

CRAWL JOBS TRACKED · · 150+ active

WINDOW · · · · · · 30d trailing

UPDATED · · · · · · 2026-05-19

01

JS-rendered navigation

% missed pages · Links in onClick, not in href

02

Pagination gaps

% missed pages · Infinite scroll, JS-loaded pages

03

Canonical / dupe filtering

% missed pages · Over-aggressive dedup drops variants

04

robots.txt scope errors

% missed pages · Disallow rules misread or ignored

05

Session-gated URLs

% missed pages · URL only visible when logged in

// 06 — DataFlirt's crawl scheduler

Crawl smart,

not just fast.

DataFlirt's crawler uses a priority-weighted frontier — product detail pages are fetched before pagination, fresh category pages before known-stable ones. Crawl budget is allocated per domain based on historical yield rate: domains where 90% of pages produce usable records get more budget than domains where 40% are duplicates or low-value. This makes coverage decisions explicit, not accidental.

Crawl scheduler state

Priority queue status for one e-commerce domain crawl job.

domain flipkart.com

frontier.size 14,280 URLs

priority.product high · 8,440 URLs

priority.category medium · 4,210 URLs

priority.pagination low · 1,630 URLs

dedup.seen 92,441 URLs

budget.remaining 28,000 req · today

politeness.delay 1.5s effective

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About frontier management, crawl budget, politeness rules, handling JavaScript navigation, and how DataFlirt achieves high coverage without triggering rate limits.

Ask us directly →

What's the difference between a crawler and a scraper? +

A crawler discovers URLs. A scraper extracts data from URLs. Most production pipelines do both — crawl to find every product page, scrape to extract the product data from each one. The distinction matters architecturally: crawl logic and extraction logic have different failure modes and should be separate components with separate monitoring.

How do I crawl sites that use JavaScript for navigation? +

You need to either execute the JavaScript (headless browser) or reverse-engineer the underlying API calls the JS is making. Most modern SPAs make XHR or fetch calls to JSON APIs to load navigation — intercept those in browser devtools. Direct API access is dramatically faster than rendering every page just to extract navigation links.

What is crawl budget and why does it matter? +

Crawl budget is the number of pages you can fetch from a target before hitting rate limits, IP blocks, or your own cost ceiling. Managing it means prioritising high-value pages (product detail, pricing) over low-value ones (pagination, static content), deduplicating aggressively, and not re-fetching pages that haven't changed since the last crawl.

What does 'politeness' mean in crawling and do I have to follow it? +

Politeness means respecting the crawl delay specified in robots.txt and not sending requests faster than the server can handle them. Ignoring it gets you blocked — not for ethical reasons but practical ones. A server that sees 50 req/s from one IP will rate-limit it within seconds. Politeness is also the right thing to do for shared infrastructure.

How do I handle duplicate URLs and avoid crawling the same page twice? +

Canonical URL normalisation before adding to the frontier: strip tracking parameters (utm_*, ref=), normalise trailing slashes, resolve redirects, and lowercase the host. Store a hash of the normalised URL in a seen-set (a Redis SET works well at scale). Check before enqueue, not before fetch — once it's in the queue, you've already spent the dedup budget.

How does DataFlirt handle crawl coverage guarantees? +

We define scope explicitly before each crawl job — URL patterns, depth limits, and exclusion rules. Coverage is measured as fetched URLs vs the estimated scope size, tracked per run. When coverage drops below 95%, the job is flagged for review. We also run periodic full re-crawls to catch pages that were missed in incremental runs due to scope boundary edge cases.

$ dataflirt scope --new-project --target=web-crawling READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h