← Glossary / Focused Crawler

What is Focused Crawler?

Focused crawling is a crawl strategy that restricts traversal to pages topically relevant to a predefined domain of interest — ignoring all outbound links that score below a relevance threshold. Instead of mapping the entire web graph, it stays on-topic: a crawler targeting product listings never follows the "About us" or "Press" links that would waste quota and dilute your dataset. For data pipelines, it's the difference between 10k relevant records and 10k records of noise.

CrawlingRelevance ScoringLink BudgetFocused TraversalTopical Crawl
// 02 — definitions

Stay on
topic.

How a crawler decides, at every link, whether to follow or discard — and why getting that decision wrong burns quota on pages that will never land in your dataset.

Ask a DataFlirt engineer →

TL;DR

A focused crawler scores every discovered URL against a topic model before fetching it. Links below threshold are dropped. This keeps crawl depth manageable and keeps your dataset clean without post-hoc filtering. The classifier is typically a naive Bayes or a lightweight embedding similarity — not a full LLM call per URL. DataFlirt uses content-hash caching to avoid re-scoring pages we've already classified.

01Definition & structure
A focused crawler has two components working in tandem: a frontier evaluator that scores URLs before fetching them, and a topic model that defines what "relevant" means for this crawl. The topic model is built from seed pages you supply — the crawler embeds them and uses the centroid as the relevance target. The frontier evaluator scores every discovered link against that centroid and only queues links that exceed the threshold θ. Everything below threshold is dropped without a fetch, preserving your quota for pages that will actually yield data.
02How it works in practice
The crawl starts from seed URLs. Each fetched page is parsed for outbound links. For each link the evaluator scores relevance using URL tokens, anchor text, and (optionally) a quick fetch of the page's <title> and meta description. Links scoring above θ go into the priority queue, sorted by score descending — highest relevance gets fetched first. Pages that score below threshold are logged and discarded. The cycle repeats until the frontier is empty or the fetch budget is exhausted.
03Threshold tuning & the precision/recall tradeoff
The threshold θ is the parameter that determines everything. Set it too high and recall drops — you miss relevant pages that scored just below the cut. Set it too low and precision drops — you're crawling noise and your dataset bloats. The standard approach is to hold out a validation set of known-relevant URLs from the seed, sweep θ from 0.5 to 0.9, and pick the value that maximises F1. This takes minutes against a seed sample, but prevents hours of wasted crawl time in production.
04How DataFlirt handles it
We run a calibration pass on every new focused crawl pipeline before touching production. You give us seed URLs; we embed them, sweep thresholds, and present a precision/recall curve for your sign-off. Our frontier evaluator caches embeddings by content-hash, so paginated URLs like /category?page=47 score in under 1 ms once the pattern is cached. We also run a post-crawl audit — sampled pages from the drop list — to confirm nothing relevant was filtered out.
05Common misconception: focused crawling is just URL allowlisting
URL allowlisting — hardcoding path patterns like /products/ — is the naive version. It breaks the moment a site restructures its URLs or uses opaque identifiers. True focused crawling uses semantic relevance, which means it discovers pages you didn't know existed and still filters them correctly. The difference matters on large catalogs: allowlist-based crawls typically miss 15–40% of product pages that don't match the expected URL pattern.
// 03 — the model

How relevance
gets scored.

Focused crawlers live and die by their relevance function. Too tight and you miss adjacent pages with yield. Too loose and you're back to breadth-first. DataFlirt's pipeline planner tunes the threshold per target domain using a seed-page sample before the crawl starts.

URL relevance score = R(u) = sim(embed(u), embed(topic)) ≥ θ
Only fetch u if cosine similarity to the topic vector clears threshold θ. Chakrabarti et al., 1999 — Focused Crawling
Link budget efficiency = E = relevant_pages_fetched / total_pages_fetched
E → 1.0 means zero wasted fetches. Naive crawlers average E ≈ 0.05–0.15 on large sites. DataFlirt pipeline benchmarks, 2026
Classifier threshold tuning = θ* = argmaxθ F1(precision(θ), recall(θ))
Maximise F1 on a seed sample before committing to a full crawl run. Standard IR practice
// 04 — link scoring trace

Every link scored,
before it's fetched.

A trace from a focused crawl targeting e-commerce product pages on a large Indian retail site. The frontier evaluator scores each discovered URL before it enters the fetch queue.

θ = 0.72topic: product-listingsseed pages: 48
edge.dataflirt.io — live
CAPTURED
// frontier evaluator — batch 7
url: "/electronics/mobiles/iphone-15-pro"
embed_sim: 0.91 → ENQUEUE

url: "/about/careers"
embed_sim: 0.14 → DROP (below θ)

url: "/electronics/mobiles?page=2"
embed_sim: 0.88 → ENQUEUE

url: "/blog/top-10-smartphones-2026"
embed_sim: 0.69 → DROP (0.03 below θ)

// batch summary
urls_evaluated: 214
urls_enqueued: 61
efficiency.E: 0.285 // up from 0.11 at θ=0.50
classifier_ms_avg: 2.1 ms // cached embeddings
// 05 — relevance signals

What drives
the classifier.

Focused crawlers use a stack of signals to score URLs before fetching. The stronger the signal, the more it moves the relevance score. Rankings below reflect median importance across DataFlirt's e-commerce and financial data pipelines.

PIPELINES MEASURED ·  ·   38 active
AVG EFFICIENCY (E) ·  ·   0.31
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

URL path tokens

high signal · /product/ vs /about/ tells you most
02

Anchor text on the link

strong signal · the link label from the parent page
03

Parent page content

medium signal · topic of the page that contained link
04

Domain / subdomain

weak signal · shop.example.com vs blog.example.com
05

HTTP content-type

gating signal · skip PDFs, images before scoring
// 06 — our approach

Tight scope,

zero wasted fetches.

We configure focused crawlers per-target using a seed set you provide — typically 20–50 representative URLs. Our frontier evaluator caches embeddings by content-hash so repeat URL patterns (pagination, filters) score in under 1 ms. The threshold is auto-tuned against a precision/recall curve on the seed before the first production fetch runs.

Focused crawl — live config

Runtime parameters for a focused crawl pipeline against a large marketplace.

topic.model dataflirt-embed-v3cached
threshold.θ 0.72auto-tuned
seed.pages 48 URLs
efficiency.E 0.31 avgon target
frontier.size 12,400 URLs queued
drop.rate 69%noise filtered
classifier.p99_ms 4.2 mswithin SLO

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About focused crawling, relevance thresholds, topic models, and how DataFlirt scopes crawls before running them in production.

Ask us directly →
How do you define the topic for the crawler? +
You provide seed URLs — typically 20–50 representative pages from the target. We embed them, average the vectors, and use that centroid as the topic representation. You can also provide explicit topic keywords, which we embed and blend with the seed centroid at a configurable weight.
What happens if the threshold is too tight? +
You miss pages. A common failure mode is filtering out paginated listing pages because they share few tokens with a typical product page. We catch this by running a seed-page recall check before production — if known relevant URLs score below θ, we widen the threshold and re-check.
Is focused crawling slower than breadth-first? +
Per-URL, yes — there's a scoring step before each fetch. In practice the total wall time is often lower because you're fetching far fewer pages. On a 10M-page site targeting 50k relevant pages, focused crawling typically fetches 150k–300k pages vs 2M+ for breadth-first.
Do you use LLMs to score relevance? +
No, not per-URL. Full LLM inference per link would cost ~$0.001 per URL — unworkable at scale. We use cached sentence embeddings scored by cosine similarity. LLMs come in during pipeline setup for seed analysis and threshold tuning, not during live crawl.
Can you handle sites where URLs give no topical signal? +
Yes, but we fall back to fetching and scoring the page content instead of the URL alone. This increases fetch volume by roughly 30–40% but keeps classifier accuracy high. Sites with hash-based or UUID-based URLs (e.g. /p/a3f9c2b1) usually require this mode.
What's the difference between focused crawling and a sitemap scrape? +
Sitemap scrapes trust the site to tell you what's relevant — useful when the sitemap is accurate and scoped. Focused crawling discovers pages the sitemap may not list (dynamic pages, A/B variants, category sub-pages) and filters by topic rather than URL inclusion. We use both in combination where sitemaps exist.
$ dataflirt scope --new-project --target=focused-crawler READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h