← Glossary / Anchor Text Extraction

What is Anchor Text Extraction?

Anchor text extraction is the process of pulling the human-readable label from an HTML hyperlink — the text between <a> and </a> — along with its target URL, and storing both as structured fields. For scrapers, it's how you turn a page's link graph into navigable data: category hierarchies, product cross-links, pagination chains, and breadcrumb trails are all encoded in anchor text that a naive HTML-to-text dump silently discards.

DataHTML ParsingLink GraphNLPNavigation
// 02 — definitions

What the link
actually says.

The anchor text is the negotiated label between the author and the reader — and a surprisingly dense signal about what the linked page contains, how the site is structured, and where your crawler should go next.

Ask a DataFlirt engineer →

TL;DR

Anchor text extraction captures the visible link label, the href, and optionally rel and title attributes from every <a> element on a page. Clean extraction handles multi-node anchors (icon + text, nested spans), deduplicates internal vs. external links, and normalises relative URLs. Skipping it means losing the site's own vocabulary for its content — which downstream NLP pipelines and link-graph traversal both depend on.

01Definition & structure
An anchor element in HTML is <a href="...">text</a>. Extraction captures three things from every anchor node:
  • href — the raw attribute value, resolved to an absolute URL against the document base
  • text — the rendered visible label, obtained via innerText traversal (not textContent, which includes hidden nodes)
  • rel — the relationship array (nofollow, noopener, sponsored, etc.)
A fourth field, title, is extracted when present. The result is a typed link record, not a raw string.
02How it works in practice
After a page renders fully, the extractor calls document.querySelectorAll('a[href]'), iterates the NodeList, and for each node: resolves the href against document.baseURI, reads element.innerText.trim(), reads element.relList as an array, and drops any record where text is empty or href is a javascript: or bare # URI. The result is a JSONL record per anchor, flushed to the pipeline before the browser context is closed.
03Anchor text as a crawl signal
Anchor text isn't just data to collect — it's a signal for what to crawl next. A product listing page with anchors like "Next page", "Page 3", and "/category/shoes?page=4" tells your scheduler exactly where the continuation URLs are. Category nav anchors reveal the site's taxonomy. "nofollow" rel flags tell you which links the site explicitly excludes from its own authority graph — useful for deciding whether to follow them yourself.
04How DataFlirt handles it
We run anchor extraction as a post-render step inside the browser context — not on the raw HTML. Our extractor handles icon-only anchors (dropped), mixed icon+text anchors (text-only retained), relative URL resolution (base-href aware), and deduplication across nav, body, and footer repetitions. Output is a typed link record in JSONL, delivered alongside the main page payload. For link-graph jobs, we aggregate across the full domain crawl and emit an edge list with anchor text as the edge label.
05The innerText vs textContent trap
textContent returns the raw text of every descendant node, including hidden ones — so an anchor with a visually hidden accessibility label (<span class="sr-only">(opens in new tab)</span>) will pollute your extracted text. innerText respects CSS visibility and display:none, giving you what the user actually reads. The difference is invisible in dev tools but silently corrupts 5–15% of anchor records on real e-commerce pages.
// 03 — the model

How link text
becomes structured data.

Anchor extraction is a mapping problem, not a search problem. The three models below capture how DataFlirt's pipeline moves from raw DOM nodes to a typed, deduplicated link record ready for graph analysis or NLP.

Link record schema = Link = { href: URL, text: str, rel: str[], title: str? }
Minimum viable record — href + normalised text + rel attributes cover 95% of use cases. HTML Living Standard, WHATWG
Text normalisation = text = trim(collapse_whitespace(innerText(node)))
innerText traversal handles nested spans, icon elements, and mixed-content anchors correctly. W3C DOM Level 3
Link density per page = Dlink = |unique hrefs| / word_count
High D_link (&gt; 0.15) flags navigation pages vs content pages — useful for crawl prioritisation. DataFlirt crawler heuristics, 2025
// 04 — extraction trace

One product page,
every anchor dissected.

A live extraction trace from an e-commerce category page — 84 anchors, 3 types, normalised and classified in a single pass.

Playwright DOMXPath traversalURL normalisation
edge.dataflirt.io — live
CAPTURED
// raw anchor nodes found
total_anchors: 84
empty_text_skipped: 11 // icon-only links
image_only_skipped: 6 // no accessible text

// classified by rel + href pattern
type.internal_nav: 34 // /category/, /brand/
type.product_link: 29 // /p/sku-*
type.external: 8 // rel="nofollow noopener"
type.pagination: 5 // ?page=2..6

// sample extracted records
href: "/p/mens-running-shoes-42"
text: "Men's Running Shoes"
rel: []
title: "Nike Air Zoom Pegasus 41"

// URL resolution
base_url: "https://example.com"
resolved: 67/67 relative URLs normalised
malformed: 2 javascript:void(0) discarded
// 05 — extraction factors

Where anchor extraction
actually breaks down.

Anchor text extraction looks simple until you hit the DOM patterns that break naive implementations. These are the failure modes DataFlirt's parser handles explicitly, ranked by how often they corrupt production datasets.

AVG ANCHORS / PAGE ·  ·   ~120
ICON-ONLY RATE ·  ·  ·    ~14%
RELATIVE HREFS ·  ·  ·    ~73%
01

Icon + text mixed anchors

~38% of links · SVG/img sibling with text node — innerText handles it, textContent doesn't
02

Relative URL resolution

~73% of hrefs · Must resolve against document base, not page URL
03

Duplicate href dedup

~22% overlap · Same URL linked multiple times per page — nav + body + footer
04

javascript: and # hrefs

~9% of anchors · Non-navigable hrefs must be filtered before graph traversal
05

Lazy-rendered link lists

client-side only · Pagination and infinite scroll anchors absent from SSR HTML
// 06 — our approach

DOM traversal,

not regex on raw HTML.

Regex on raw HTML breaks on nested elements, encoded entities, and multiline attributes — always. DataFlirt's extraction layer uses a post-render DOM traversal via Playwright's evaluate API, walking every anchor node with innerText resolution and attribute normalisation before the record leaves the browser context.

anchor-extractor.config

Live extraction config for an e-commerce product listing pipeline.

renderer Playwright · Chromium 124
traversal DOM · querySelectorAll('a')post-render
text_method innerText · whitespace collapsed
url_resolution base href awarenormalised
filter.empty icon-only, javascript:, #
rel_extraction nofollow / noopener flagged
output.format JSONL · href + text + rel + title

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About anchor text extraction, link graph construction, and how DataFlirt handles the DOM edge cases that break naive implementations.

Ask us directly →
Why not just extract anchor text with a regex on the raw HTML? +
Regex breaks on nested elements — an anchor with a <span> child, an icon, and a text node will confuse any pattern-based extractor. It also can't resolve relative URLs without the document's base, and it doesn't handle encoded entities or multiline attributes correctly. DOM traversal after rendering is slower but correct.
Does anchor text extraction work on JavaScript-rendered pages? +
Only if you render first. Anchors injected by React, Vue, or any client-side router don't exist in the initial HTML response — you need a full Playwright or Puppeteer pass to materialise them before extraction.
How do you handle anchor text for pagination chains? +
Pagination anchors are extracted and classified by URL pattern — ?page=, /page/2, ?offset= — then queued into the crawl frontier. We extract both the numeric label and the resolved absolute URL, so the downstream scheduler knows exactly which pages exist.
What's the difference between anchor text and the title attribute? +
Anchor text is what the user sees and clicks. The title attribute is a tooltip — often more descriptive but not always present. Both are valuable: anchor text is the page author's chosen label for the target, while title often carries the full product name or article headline.
How do you use anchor text extraction for link graph analysis? +
Each extracted link record — href + text — becomes a directed edge in the site's link graph. Anchor text is the edge label. Run it across a full domain crawl and you can reconstruct the site's category hierarchy, find orphaned pages, and identify which products get the most internal cross-linking.
Can anchor text extraction help with NLP pipelines downstream? +
Yes. Anchor text is a curated vocabulary — the site's own terms for its content. It's cleaner than body text for training classifiers or building entity dictionaries. Feed it to a term-frequency pipeline and you get a fast approximation of the site's content taxonomy without touching the body copy.
$ dataflirt scope --new-project --target=anchor-text-extraction READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h