← Glossary / HTML Scraping

What is HTML Scraping?

HTML scraping is the practice of fetching a web page's raw markup and extracting structured data from it by parsing the DOM tree — using CSS selectors, XPath, or regex against the rendered or raw HTML string. For data pipelines, it's the foundational extraction layer: everything upstream (proxies, fingerprinting, JS rendering) exists to get you a clean HTML response, and everything downstream (parsing, dedup, delivery) depends on that response being the real page, not a bot wall.

InfrastructureDOMParsingCSS SelectorsXPath
// 02 — definitions

Parse the
page.

HTML is a tree. Scraping is the art of navigating that tree reliably across sites that never intended to be navigated by machines — and that change without notice.

Ask a DataFlirt engineer →

TL;DR

HTML scraping fetches a page and extracts data by querying the DOM. The hard part isn't parsing — parsers are solved. The hard part is getting a response that matches what a real browser renders, staying stable as sites change, and doing it at a rate the target tolerates. Most pipeline failures trace back to one of those three, not to the parser.

01Definition & how it works

HTML scraping is a three-step operation: fetch the HTTP response, parse the HTML into a navigable tree, extract target nodes using selectors or traversal. The fetch returns bytes. The parser turns bytes into a DOM. The extractor queries the DOM for specific values.

The parser is almost never the failure point. Libraries like cheerio, lxml, BeautifulSoup, and html.parser handle malformed markup well. The failure points are: getting a response that contains the actual data (not a bot wall), and keeping selectors aligned as the site's HTML changes.

02Raw HTML vs rendered HTML

The server sends raw HTML over the wire. A browser then executes JavaScript, which may rewrite the DOM entirely before the user sees anything. For SPAs built on React, Vue, or Next.js, the raw HTML is often just a loader skeleton — all meaningful content is injected client-side.

Test for this before choosing your approach: fetch the URL with curl and check whether your target data is present. If it is, you don't need a browser. If it isn't, you need a JS-executing environment. Using a headless browser on static HTML wastes 5–10× the resources for no gain.

03Selector strategies and stability

Selector stability is the primary long-term cost of HTML scraping. Prefer selectors in this order of stability:

  • Semantic IDs and data attributes#productTitle, data-testid="price" — intentional, rarely renamed
  • Structural semantic tagsh1, main, article — stable across redesigns
  • Class names with semantic intent.product-price — moderately stable
  • Generated class names.a3B7_x — break on every build deploy
  • Positional XPath//div[3]/span[2] — breaks on any layout change

Always write a fallback selector. The first selector is optimistic; the fallback is what keeps the pipeline alive at 3am.

04How DataFlirt handles selector rot

Every pipeline we operate runs a continuous schema fingerprint check. On each run, extracted field yield is compared against the last 10 runs. A drop triggers a Slack alert to the extraction team within minutes — not hours.

We maintain primary and fallback selector pairs for every field, auto-tested on every deploy. When a site ships a breaking change, we're usually patching before the client's dashboard shows a gap. Mean time to restore yield after a structural change across our active pipelines: under 4 hours.

05The one thing naive scrapers always get wrong

They validate the scraper, not the output. A scraper that returns something for every record looks healthy in monitoring. A scraper that silently returns stale, wrong, or incomplete data because a selector drifted looks identical.

The correct monitoring target is field yield — the fraction of expected fields populated per record — not request success rate. A 200 OK with a bot-wall HTML body and a 200 OK with real product data are indistinguishable at the HTTP level. You have to validate what came back.

// 03 — the model

What makes a
scraper reliable?

Scraper reliability is a product of three independent failure modes. DataFlirt's pipeline health score tracks all three continuously — a green pipeline needs all three above threshold simultaneously.

Parse success rate = PSR = 1 − (selector_misses + malformed_html + schema_drift) / total_requests
Target PSR > 0.97. Below 0.90 triggers schema review. DataFlirt pipeline SLO
Extraction yield = Y = fields_extracted / (fields_expected × pages_fetched)
Yield drops before block rate rises — the first signal of selector rot. DataFlirt monitoring, 2026
Schema drift velocity = SDV = Δselector_failures / Δtime
SDV spike > 3× baseline triggers automated selector re-training. Internal alerting SLO
// 04 — extraction trace

From HTTP response
to structured record.

A single product page extraction on a retail pipeline. Shows the parse chain from raw HTML fetch through selector evaluation to the final output record.

cheerio parserCSS selectorsretail pipeline
edge.dataflirt.io — live
CAPTURED
// fetch
request.url: "amazon.in/dp/B0CXYZ1234"
response.status: 200 OK
response.bytes: 284,912
response.type: text/html; charset=utf-8

// parse
parser: "cheerio@1.0"
dom.nodes: 4,821
selector.title: "#productTitle" // 1 match
selector.price: "span.a-price-whole" // 1 match
selector.rating: "span[data-hook=rating-out-of-text]" // 1 match
selector.stock: "#availability span" // 0 matches — schema drift

// fallback
selector.stock.fallback: "#merchantInfoFeature_feature_div" // 1 match

// output record
record.title: "boAt Rockerz 450 Pro"
record.price_inr: 1299
record.rating: "4.1 out of 5 stars"
record.yield: 0.97 // 3/4 fields from primary selectors
// 05 — failure modes

Where HTML
scrapers break.

Ranked by frequency across DataFlirt's active retail and e-commerce pipelines. Selector rot and JS rendering failures account for over 70% of all extraction failures — the parser itself almost never fails.

PIPELINES TRACKED ·  ·    300+ active
WINDOW ·  ·  ·  ·  ·  ·   90d trailing
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Selector rot / schema drift

% of failures · Site redesigns break CSS paths
02

JS-rendered content

% of failures · Data not in raw HTML at all
03

Bot wall / block response

% of failures · CAPTCHA or empty body returned
04

Malformed / truncated HTML

% of failures · Incomplete transfers, gzip errors
05

Encoding / charset issues

% of failures · UTF-8 vs Latin-1 mis-detection
// 06 — how DataFlirt manages selector drift

Selectors rot.

ours self-heal.

Every DataFlirt pipeline runs a parallel shadow extraction on 1% of traffic, comparing structured output against the previous run's schema fingerprint. When field yield drops below threshold, the selector is flagged and our extraction team ships a patch — typically within 4 hours. Clients see data continuity; they don't see the fire.

Selector health monitor

Live extraction health for one retail pricing pipeline.

pipeline.id retail-pricing-IN-042
schema.version v14
selector.title #productTitlestable
selector.price span.a-price-wholestable
selector.stock #availability spandrifted
fallback.active yes · patched 2h ago
yield.current 0.97
drift.alert resolved

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About parser choice, selector stability, JS rendering tradeoffs, and how DataFlirt keeps extraction yield above SLO as sites change.

Ask us directly →
Should I use CSS selectors or XPath for HTML scraping? +
CSS selectors for most things — they're faster to write, easier to read, and sufficient for 90% of extraction tasks. XPath when you need to traverse upward in the DOM (select a parent based on a child's content) or work with XML-structured HTML. Avoid regex on raw HTML strings for anything structural — it breaks on the first attribute reorder.
Do I need a full browser to scrape HTML? +
Only if the data you need is injected by JavaScript after the initial load. Check: view the page source (Ctrl+U) — if your target data is there, you need a plain HTTP fetch, not a browser. Headless Chrome adds 3–8× latency and cost. Use it only when raw HTML genuinely doesn't contain the data.
How often do selectors break, and how do I handle it? +
High-traffic retail sites redeploy weekly. Expect major selector breaks 2–4× per year per target, with minor attribute changes monthly. Handle it with: layered fallback selectors, schema validation on every extraction run (not just spot checks), and alerting on yield drop rather than waiting for total failure.
What's the difference between raw HTML and rendered HTML? +
Raw HTML is what the server sends over the wire. Rendered HTML is what the browser builds after executing JavaScript — which may add, remove, or substantially rewrite DOM nodes. For React/Next.js SPAs, the raw HTML is often just a shell with one div and a script tag. You need Playwright or Puppeteer for the rendered version.
Is HTML scraping legal? +
Scraping publicly accessible HTML is lawful in most jurisdictions — affirmed by hiQ v. LinkedIn (9th Circuit, 2022) in the US and broadly consistent with EU and Indian law for public data. The boundaries are: don't bypass authentication, don't scrape personal data at scale without basis, and don't violate a ToS in ways that cause demonstrable commercial harm.
How does DataFlirt handle sites that frequently change their HTML structure? +
Shadow extraction on 1% of traffic compares field yield against the previous schema fingerprint on every pipeline run. A yield drop below 0.93 triggers automated fallback selector evaluation and a human patch review. Mean time to restore full yield after a schema change: under 4 hours across our active pipelines.
$ dataflirt scope --new-project --target=html-scraping READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h