← Glossary / Link Extraction

What is Link Extraction?

Link extraction is the parsing step that reads fetched HTML (or JavaScript-rendered DOM) and pulls out URLs to feed into the crawler's frontier — the mechanism by which a crawl expands beyond its seed set. Do it naively and your frontier fills with pagination duplicates, session tokens, and offsite noise; do it precisely and every enqueued URL is a high-probability path to content you actually want.

HTML parsingURL discoveryFrontier populationDOMData
// 02 — definitions

Finding URLs
in fetched pages.

Link extraction is not just grabbing every href on the page. It's a filtering, normalizing, and scoping pipeline that decides which of the hundreds of URLs in a typical page are worth crawling at all.

Ask a DataFlirt engineer →

TL;DR

Link extraction parses fetched HTML for anchor tags, canonical tags, sitemap references, and structured data URLs — normalizes them, filters out out-of-scope paths, and pushes qualifying URLs to the frontier. On a JavaScript-rendered page, extraction runs against the final DOM state post-execution, not the raw HTML — meaning you need a real browser (or at least a DOM evaluator) to catch links injected by React, Vue, or lazy-loading scripts.

01Definition & structure
Link extraction is a four-stage pipeline that runs on every successfully fetched page:
  • Parse — read the HTML or evaluated DOM and collect raw URL candidates from <a href>, <link rel=canonical>, JSON-LD @id fields, and any other configured sources
  • Normalize — strip fragments, sort query params, canonicalize trailing slashes, remove tracking tokens; convert relative URLs to absolute
  • Filter — apply scope rules: must match allowed patterns, must not match blocked patterns, must be within the configured host set
  • Deduplicate — check against the Bloom filter; URLs already seen or already in the frontier are dropped
What survives gets enqueued into the URL frontier with a priority score. On a typical retail page this reduces 80–150 raw links to 5–20 frontier-worthy URLs.
02How it works in practice
The crawler's fetch worker returns HTML to the extraction pipeline immediately after a successful response. For static HTML, an lxml or BeautifulSoup parse takes ~5–20ms per page. For JavaScript-rendered pages, the Playwright instance that fetched the page evaluates document.querySelectorAll('a[href]') on the final DOM state — no second browser launch needed since the page is already loaded. Extracted URLs flow into the normalizer synchronously, then into the scope filter, then into the Bloom check. The Bloom check is the bottleneck at scale: at 100M URLs, a well-sized filter still runs in ~1µs per lookup, making it negligible compared to parse time.
03Static HTML vs. JavaScript-rendered extraction
The gap between static and rendered extraction is significant on modern retail sites. A product grid page on a React SPA may return raw HTML with zero product hrefs — the links are injected after DOMContentLoaded by JavaScript. In those cases:
  • Static extraction yields: navigation links, footer links, and script tags — none of which are product URLs
  • Playwright extraction yields: the full rendered product grid, lazy-loaded images with data-href attributes, and pagination controls
The cost is ~2–4 seconds of render overhead per page. For catalog-wide link discovery, that compounds — which is why we route JS-rendered sites through a dedicated Playwright pool and static sites through a lightweight HTTP client pool, never mixing them.
04How DataFlirt handles it
We configure extraction per-target, not globally. Each pipeline specifies: which DOM sources to extract from, which URL patterns to allow and block, and whether to use the static or browser extraction path. Scope filters are compiled to regex automata at pipeline startup — not evaluated per-URL as strings, which would be 100x slower at volume. We track extraction precision as an SLO metric: if the 7-day rolling precision for any pipeline drops below 0.50 (more than half of enqueued URLs producing no useful data), it triggers a filter rule review. Most precision drops trace to new site sections the block-list doesn't cover yet.
05The session token trap
One of the most expensive extraction mistakes is failing to strip session tokens from extracted URLs before the Bloom check. A site that appends ?sid=abc123 to every internal link will cause the Bloom filter to treat each session variant as a unique URL — you'll crawl the same product page hundreds of times, once per unique session token, before the dedup layer catches on. The fix is a URL normalization rule that strips known session parameter names before the Bloom check. We maintain a shared blocklist of ~80 common session and tracking parameters that are stripped unconditionally across all pipelines.
// 03 — the math

How extraction shapes
frontier growth.

These three relationships govern how link extraction affects crawl efficiency. The key insight is that extraction precision matters as much as recall — low-precision extraction wastes budget on useless URLs.

Frontier growth rate = F' = F + (Lextracted · Ppass) − Ldedup
F = current frontier size. P_pass = fraction passing scope filter. L_dedup = already-seen URLs filtered out. Standard crawler design
Extraction precision = Prec = relevant URLs extracted / total URLs extracted
Low precision means frontier fills with noise. Target Prec > 0.6 for focused product crawls. IR evaluation framework
Crawl amplification factor = A = avg_links_per_page · Ppass
A > 1 means frontier grows faster than you fetch. A < 1 means crawl will exhaust itself — check your filter rules. Internal DataFlirt metric
// 04 — what extraction produces

One page parsed,
URLs scored and filtered.

A trace from our extraction pipeline processing a single product listing page. 94 raw links found; 11 pass the scope filter and reach the frontier.

lxml parserURL normalizerscope filter
edge.dataflirt.io — live
CAPTURED
// raw extraction from DOM
page: "https://shop.example.in/category/electronics/"
links_found_raw: 94

// normalization pass
after_fragment_strip: 91 // 3 were #anchor-only
after_dedup_on_page: 74 // 17 duplicate hrefs
after_canonicalize: 68 // trailing slash + query param sort

// scope filter
offsite_dropped: 34 // external domains
blocked_pattern_dropped: 23 // /cart/ /account/ /wishlist/
already_in_frontier: 0 // Bloom says not seen

// frontier enqueue
urls_enqueued: 11 // product pages + 2 subcategory roots
extraction_precision: 0.73
// 05 — link sources

Where URLs hide
in a page.

Anchor tags are obvious. The other sources are where a lot of product URLs live — especially on modern JS-heavy retail sites that lazy-load product grids, inject canonical tags, or expose structured data with full URL references.

LINKS PER RETAIL PAGE  60–200 raw
PASS RATE (product crawl  5–20%
JS-INJECTED LINKS ·  ·    20–60% on SPAs
01

Anchor tags (href)

highest volume · Navigation, pagination, product links — all mixed together
02

Canonical link tags

dedup signal · Points to the preferred URL — use for normalization, not just extraction
03

JSON-LD structured data

high precision · @type:Product URLs are almost always product detail pages
04

Lazy-loaded JS href injection

requires browser · React/Vue render product grids after scroll — invisible to HTTP clients
05

Sitemap references in HTML

structural · Some sites embed sitemap URLs in footer or head tags
// 06 — our extraction stack

Parser, normalizer,

scope filter, Bloom check.

Our extraction pipeline runs in four sequential stages, each independently configurable per target. The parser is lxml for static HTML and Playwright DOM evaluation for JS-rendered pages — we never mix them on the same target because the URL sets they produce differ enough to cause dedup collisions. The scope filter is a compiled regex set built from per-target config, evaluated against normalized URLs before the Bloom check to minimize unnecessary filter operations on already-seen URLs.

Extraction pipeline — stage config

Configuration for a recurring Indian e-commerce target with a React SPA frontend.

parser Playwright DOM evalJS-rendered
sources a[href], link[rel=canonical], JSON-LD
normalizer fragment strip, query sort, trailing slashactive
scope.allow /p/, /product/, /item/
scope.block /cart/, /login/, /compare/
dedup.bloom 512M bits · 0.001% FPRactive
precision_7d 0.71 avgwithin SLO

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About parsing strategies, JavaScript-rendered links, URL normalization, and keeping extraction precision high on complex retail targets.

Ask us directly →
Does link extraction need a full browser for JavaScript-rendered pages? +
For sites where product links are injected by React, Vue, or any framework that renders after page load, yes — you need Playwright or Puppeteer to evaluate the final DOM state. A simple HTTP client fetches the raw HTML, which may contain only a loading skeleton and zero product hrefs. The cost is ~1–4 seconds of render time per page, which is why static-HTML extraction is preferred whenever the target site supports it.
What URL normalization should happen before the Bloom check? +
At minimum: strip fragment identifiers, lowercase scheme and host, sort query parameters, canonicalize trailing slashes, and decode safe percent-encoded characters. Missing any of these means the same product URL appears as multiple distinct entries in your Bloom filter — wasting both dedup budget and crawl budget re-fetching the same pages. We also strip known tracking parameters (utm_source, ref, gclid) before normalization.
How do I prevent link extraction from filling the frontier with pagination URLs? +
Pattern-based scope filtering. Block URL patterns that match pagination conventions: /page/\d+, \?page=, \?p=, \?offset=. For sites with non-standard pagination, add a depth cap per host-queue — after 3 hops from a category root, stop following next-page links. This alone eliminates 40–60% of frontier noise on most retail targets.
What's the difference between link extraction and sitemap parsing? +
Sitemap parsing is a static, one-time (or periodic) read of an XML file the site provides — it enumerates pages the owner wants indexed. Link extraction is dynamic, running on every fetched page, discovering URLs the sitemap may not list. They're complementary: sitemap gives you the known-good product set; link extraction finds new pages, updated category structures, and recently added products before the sitemap refreshes.
How do I handle rel=canonical correctly during extraction? +
Canonical tags signal the preferred URL for a page. During extraction, use the canonical URL to normalize the current page's URL for dedup purposes — don't just extract it as a new frontier entry. If the canonical points to a different host, treat that as an out-of-scope signal, not a follow-this link. Mishandling canonical tags causes both re-crawl waste and incorrect dedup — two distinct URLs treated as one.
Can link extraction introduce crawl traps, and how do you avoid them? +
Yes. Calendar-based navigation (infinite date parameters), infinite scroll pagination, and URL parameters that cycle through combinations (sort × filter × page) are all crawl traps discoverable via link extraction. The fix is always a combination of URL normalization (collapse dynamic parameters), scope filtering (block known trap patterns), and per-host depth caps. Any host queue growing faster than 500K URLs is a signal to review your extraction filter rules for that target.
$ dataflirt scope --new-project --target=link-extraction READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h