← Glossary / Targeted Scraping

What is Targeted Scraping?

Targeted scraping extracts a defined set of fields from a known document structure without fetching or rendering content you don't need. Where full-page scraping captures everything and parses later, targeted scraping encodes the schema upfront — specific CSS selectors, XPath expressions, or LD+JSON keys — and aborts as soon as those fields are resolved. The tradeoff is brittleness: a front-end redesign that moves your target element breaks extraction until selectors are updated.

InfrastructureDataSelectorsCSS / XPathSchema
// 02 — definitions

Extract only
what you need.

Encoding your schema upfront in selectors — and accepting that a DOM change on the target's side breaks your pipeline until you patch it.

Ask a DataFlirt engineer →

TL;DR

Targeted scraping uses predefined CSS selectors, XPath, or structured-data keys to extract specific fields without rendering or storing the full page. It's 3–10x cheaper per record than full-page capture. The cost is selector maintenance — a template change on the target's side can silently null-out fields until someone notices the null-rate alert.

01Definition & structure
Targeted scraping is field-first extraction: you define the schema before you write the first line of crawler code, then build selectors that resolve exactly those fields from the DOM. The page is treated as a structured document, not a blob to capture.
  • selector chain — ordered list of expressions per field; first match wins
  • extraction layer — CSS, XPath, LD+JSON, or microdata parsers run against the fetched HTML
  • null rate monitor — tracks field-level coverage; alerts on degradation
  • fallback logic — if primary selector fails, secondary and tertiary are tried before the field is marked null
The schema is the contract between the pipeline and the downstream consumer — any change to the target's DOM that invalidates a selector breaks that contract.
02How it works in practice
The crawler fetches the URL (via HTTP client or browser, depending on render requirements), passes the response body to the extraction layer, and runs each field's selector chain in priority order. The first expression that returns a non-null result wins. The extracted record is validated against the schema (type checks, range checks, required fields), and only valid records are written to the delivery sink. Invalid records go to a dead-letter queue for manual review. Render is skipped entirely on SSR pages — the response from httpx is passed directly to the parser, cutting per-page cost by 60–80% compared to Playwright.
03Selector durability — why LD+JSON wins
LD+JSON structured data is the most durable extraction target because it's maintained by the publisher for SEO and schema.org compliance — not for your scraper. A site can rename every CSS class in a redesign without touching its LD+JSON. The downside is coverage: not every field is represented in structured data, and some sites omit or partially populate it. The practical pattern is LD+JSON first for product name, price, and rating; ARIA and data-* attributes as fallback; CSS class selectors only as a last resort with an aggressive null-rate alert threshold.
04How DataFlirt handles it
We maintain a selector registry for each target: every field has a fallback chain with at least three expressions, and the engine logs which level resolved on each fetch. A shift in resolution level — primary failing, secondary picking up — is surfaced in our observability dashboard as a soft alert before null rates climb. For new targets, we run a mapping session against 500 sampled pages to validate selector coverage before going to production. Fields below 98% coverage at mapping time don't ship until the fallback chain is extended.
05Did you know: targeted scraping can be slower per record
On JS-heavy single-page applications where every field is injected after the initial load, targeted scraping is paradoxically slower than full-page capture. You still have to launch a browser and wait for the target element to render — and aborting the render early once the element resolves adds orchestration complexity. On these targets, the practical approach is full-page capture with a targeted extraction pass against the stored HTML, combining the coverage guarantee of full-page with the parsing efficiency of targeted extraction.
// 03 — the selector model

How selectors
fail gracefully.

Selector reliability degrades as sites evolve. DataFlirt's selector engine uses fallback chains — multiple expressions per field — so a single DOM change doesn't null the entire record. The model below describes how field coverage is computed across a fallback chain.

Field coverage with fallback chain = Pfield = 1 − ∏(1 − pi)
Probability of resolving a field increases with each additional fallback selector. Probability: product of independent failure rates
Extraction cost vs. full-page = Ctargeted = Cfull-page × frender × fstorage
Targeted extraction skips render and storage for ~60–80% of page weight on typical e-commerce targets. DataFlirt pipeline benchmarks, 2026
Selector decay rate = D = Nbreaks / (Nurls × Tdays)
Measures how often selectors break per URL per day — the primary maintenance cost signal. Internal SLO
// 04 — targeted extraction trace

5 fields extracted.
No browser launched.

A targeted extraction run on a static-rendered product page using httpx + parsel. No headless browser — server-side rendering means a plain HTTP fetch is enough.

httpx 0.27parsel 1.9SSR target
edge.dataflirt.io — live
CAPTURED
// request
method: "GET" url: "https://target.com/product/B0CX7FQKN4"
proxy: "residential_IN · ASN55836 · Reliance Jio" // clean session
response_time_ms: 310 content_length: 42,880 bytes

// selector chain: price field
selector[0]: "[data-qa='price']" // resolved: ₹12,499

// selector chain: title field
selector[0]: "h1.pdp-title" // miss — class renamed
selector[1]: "[itemprop='name']" // resolved: "OnePlus 13 5G 512GB"

// selector chain: rating
selector[0]: "script[type='application/ld+json'] > @aggregateRating" // resolved: 4.4

// extraction result
fields_resolved: 5 / 5 // full coverage
render_skipped: true render_time_saved_ms: ~2,800
record_written_to: "s3://bucket/products/B0CX7FQKN4.json"
// 05 — selector strategies

How selectors
are ranked for resilience.

Not all selector strategies are equally durable. These are the extraction methods DataFlirt's selector engine uses, ranked by resilience to front-end changes on a typical e-commerce target.

AVG SELECTOR LIFESPAN  47 days
FALLBACK COVERAGE ·  ·    98.2%
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Structured data (LD+JSON)

most durable · publisher-maintained schema — breaks only if they remove structured data entirely
02

ARIA / data-* attributes

durable · accessibility attributes change less often than visual class names
03

microdata / itemprop

moderate · older but still common on product pages; less likely to change than CSS classes
04

CSS class selectors

fragile · design system refactors break these within weeks on actively maintained sites
05

XPath positional selectors

most fragile · any DOM restructure invalidates position-based paths — last resort only
// 06 — our approach

Fallback chains,

not single selectors — extraction that survives redesigns.

DataFlirt's selector engine never relies on a single expression per field. Every field has a fallback chain: LD+JSON first, ARIA/data attributes second, microdata third, CSS class last. The engine tries each in order and logs which level resolved — a shift in resolution level is the early warning that a front-end change is in progress. A selector that silently returns null is worse than one that throws an error.

selector-chain.config.json

Fallback selector chain for a product price field on a major Indian e-commerce target.

field price
selector[0] LD+JSON · offers.pricepreferred
selector[1] [data-qa='price']fallback
selector[2] [itemprop='price']
selector[3] .final-price__valuefragile — last resort
null_rate.7d 0.4%within SLA
pipeline.status active · selector[0] resolving

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About targeted scraping, selector strategy, maintenance cost, and how DataFlirt keeps null rates below threshold at scale.

Ask us directly →
When should I use targeted scraping instead of full-page capture? +
Use targeted scraping when you know exactly which fields you need, the target's DOM structure is reasonably stable, and render cost matters. If you're tracking 5 fields on 500,000 product pages daily, the compute difference between targeted and full-page is significant. If the schema is unknown or the site redesigns frequently, full-page with offline parsing is more durable.
How do you detect when a selector has broken? +
We track null rates per field per target on a rolling 1-hour window. A null rate spike above a configurable threshold — typically 2% for price fields — triggers an alert and pauses delivery until the selector is reviewed. Silent nulls that stay below threshold are caught by the daily full-sweep comparison against the previous snapshot.
Do you need a headless browser for targeted scraping? +
Only if the target field is injected by JavaScript. For server-side rendered pages, httpx or requests is enough and is dramatically faster — no browser launch, no render wait. For JS-injected content, we launch Playwright but abort the render as soon as the target element resolves, rather than waiting for full networkidle.
What's the maintenance cost of keeping selectors current? +
On actively developed e-commerce sites, CSS class selectors break every 3–8 weeks on average. LD+JSON and ARIA selectors are more stable — lifespan of 6+ months is common. Our selector engine logs which level in the fallback chain resolved on each fetch, so we see drift before it causes nulls and can update the chain proactively.
Can you do targeted scraping on paginated results pages? +
Yes — we extract the list of item URLs from each results page using a targeted selector, then apply targeted extraction to each item page. Pagination logic (next-page URL, cursor parameter, or offset increment) is handled by the crawler layer above the extractor.
How do you handle sites that obfuscate their class names? +
CSS-in-JS and atomic CSS frameworks generate random or content-hashed class names that change on every build. In those cases we skip class selectors entirely and rely exclusively on structural HTML (tag hierarchy, attribute patterns), ARIA roles, LD+JSON, and text-content matching. It requires more upfront mapping work but produces selectors that survive build-time class regeneration.
$ dataflirt scope --new-project --target=targeted-scraping READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h