← Glossary / Data Extraction

What is Data Extraction?

Data extraction is the step in a scraping pipeline where raw fetched content — HTML, JSON, XML, or binary — is parsed and transformed into structured records with typed fields. It sits between the fetch layer and the delivery layer: fetch gets you bytes, extraction gets you data. The distinction matters because extraction logic is where business value is defined, where schema drift causes silent failures, and where 80% of pipeline maintenance time is actually spent.

DataParsingSchemaTransformETL
// 02 — definitions

Bytes in,
records out.

Any pipeline can fetch a page. Extraction is what determines whether the pipeline actually produces usable data — or just a log of successful HTTP requests.

Ask a DataFlirt engineer →

TL;DR

Data extraction transforms raw fetched content into typed, structured records. It's the most brittle layer of any scraping pipeline — selectors break, schemas drift, and APIs change format without warning. The difference between a production-grade pipeline and a weekend project is almost entirely in the extraction layer: validation, fallbacks, schema monitoring, and handling partial data gracefully.

01Definition & where it sits in the pipeline

A scraping pipeline has three logical layers: fetch (HTTP, browser, API), extract (parse and structure), and deliver (store, transform, send). Data extraction is the middle layer. Its job is to take whatever the fetch layer returned and produce a record that matches a defined schema.

The extraction layer is responsible for: selecting target values from the parsed document, coercing them to the correct types, validating their presence and format, and handling the cases where expected fields are absent. It is not responsible for how the page was fetched or where the record goes next.

02Extraction from different source formats

The extraction approach depends on the source format:

  • HTML — CSS selectors or XPath on the parsed DOM. Most common, most brittle.
  • JSON APIs — path navigation (response.data[0].price). More stable than HTML, but API contracts change too.
  • XML / RSS — XPath or namespace-aware parsers. Common for product feeds and news sources.
  • PDFs — positional text extraction or table parsers. Fragile. Treat as last resort.
  • JavaScript variables — regex or AST parsing of inline <script> JSON blobs embedded in HTML.
03Schema design and versioning

A schema defines what a valid extracted record looks like: field names, types, required vs optional, and value constraints. Without a versioned schema, extraction output drifts silently — fields get added, types change, and downstream consumers break without any upstream signal.

Version your schema. Treat a schema change as a deployment event. When the source site changes and your selectors need updating, bump the schema version and log what changed. This creates an audit trail that makes debugging downstream breakage tractable.

04How DataFlirt runs extraction at scale

We run schema validation on every record, not as a batch check. Each extracted record is validated against the versioned schema contract before it's written to the output store. Records that fail type checks or completeness thresholds are quarantined and flagged — never silently written as nulls, never silently dropped.

For high-volume pipelines (10M+ records/day), we use parallel extraction workers with shared selector configs stored in a central registry. A selector update deploys across all workers within 60 seconds without pipeline downtime.

05The silent failure most pipelines miss

Type coercion failures. A price field that returns "N/A" when a product is out of stock, "Price on request" when it's enterprise-only, and "₹1,299" when it's available — all from the same CSS selector — looks like a working extractor until someone tries to compute a median price and gets a runtime error.

The fix: validate at extraction time. Every field gets a type assertion and a set of known-invalid sentinels. Anything that doesn't match goes to quarantine. The completeness metric tells you how often this happens; the type error log tells you exactly which records need attention.

// 03 — the model

What extraction
quality looks like.

Extraction quality is measured across three dimensions simultaneously. A pipeline can have high completeness but low accuracy (returning stale cached values), or high accuracy but low consistency (same field extracted differently across pages). DataFlirt tracks all three per pipeline, per field.

Completeness = C = fields_present / (fields_expected × records)
Missing fields are worse than wrong fields — they're invisible failures. DataFlirt extraction SLO
Accuracy (via spot-check) = A = 1 − (incorrect_values / spot_checked_records)
Spot-check 1% of records against source of truth per pipeline run. Internal QA process
Consistency = K = 1 − (σ_field_format / μ_field_format)
Price as string vs number across records breaks every downstream aggregation. Schema validation layer
// 04 — extraction pipeline trace

Raw HTML to
delivered dataset.

End-to-end trace of one extraction job on a B2B pricing pipeline. Shows the transform steps from raw response through field validation to the output record written to S3.

JSON outputfield validationS3 delivery
edge.dataflirt.io — live
CAPTURED
// input
source.type: "html"
source.bytes: 184,320

// parse
dom.title: extracted "Tata Steel H-Beam 150x75mm"
dom.price: extracted "₹72,400/MT"
dom.moq: extracted "25 MT"
dom.seller: missing // selector drifted — fallback triggered
dom.seller.fallback: extracted "IndiaMART Direct"

// transform
price.raw: "₹72,400/MT"
price.numeric: 72400
price.currency: "INR"
price.unit: "MT"

// validate
schema.completeness: 1.0
schema.version: "v7" // match
output.destination: "s3://df-client-042/raw/2026-05-19/"
// 05 — failure modes

Where extraction
jobs go wrong.

Ranked by share of extraction failures across DataFlirt's active pipelines. Schema drift is the dominant failure — it's slow, silent, and usually not caught until a downstream consumer notices the data is wrong.

PIPELINES MONITORED ·   300+ active
SCHEMA CHECKS ·  ·  ·  ·  per run
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Schema drift / selector rot

% of failures · Site changes without notice
02

Type coercion failures

% of failures · String-as-number, date formats
03

Missing optional fields

% of failures · Conditional DOM sections absent
04

Encoding / normalisation

% of failures · Unicode, whitespace, currency symbols
05

Pagination / multi-page state

% of failures · Incomplete record across page split
// 06 — DataFlirt's extraction architecture

Extract once,

validate continuously.

DataFlirt's extraction layer runs schema validation on every record, not just on pipeline startup. Field types, value ranges, and completeness are checked against a versioned schema contract. When a field fails validation, the record is quarantined and flagged — never silently dropped and never written to the client's dataset with a null that shouldn't be there.

Extraction job health

Live status of one extraction job mid-run on a manufacturing data pipeline.

job.id extract-mfg-IN-017
records.processed 12,441
completeness 0.994
type_errors 3 records
quarantined 3 records · pending review
schema.version v7current
output.written 12,438 records

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About extraction architecture, schema management, type handling, and how DataFlirt maintains data quality across pipelines that change without warning.

Ask us directly →
What's the difference between data extraction and web scraping? +
Web scraping is the full pipeline — fetch, extract, store, deliver. Data extraction is specifically the step that parses fetched content and produces structured records. You can extract from HTML, JSON APIs, PDFs, CSV files, or XML feeds. Scraping usually implies HTTP fetching as the source; extraction is agnostic to source format.
How do I handle fields that are sometimes present and sometimes missing? +
Model them explicitly as optional in your schema with a known sentinel value — null for absent, not an empty string, not zero. Then track missing-rate per field over time. A field that's missing 2% of the time is expected; the same field missing 40% of the time is a selector failure masquerading as data. Those look identical if you don't track the rate.
What's the right way to handle price strings like '₹72,400/MT'? +
Parse them at extraction time, not downstream. Strip currency symbols, remove locale-specific thousand separators, normalise decimal separators, extract the unit of measure as a separate field. Storing "₹72,400/MT" as a string and letting the analytics team deal with it is a guarantee of inconsistency. Type coercion failures are the second most common extraction failure mode.
How often should I validate extraction output? +
Every run, not spot checks. Schema validation — field presence, type correctness, value range sanity — should run on every record. Save manual spot-checks for accuracy validation (is the extracted value actually what the page shows). Completeness and type errors are automatable; accuracy requires human judgment but can be sampled at 1%.
Is it better to extract everything and filter later, or extract only what you need? +
Extract what you need, with defined schema, and quarantine everything that fails validation. Extracting everything creates unbounded schema debt — you end up with 200-column datasets where 160 columns are noise and 40 are the actual data. Define the schema first, extract to it, version it when it changes.
How does DataFlirt handle schema changes on a live pipeline? +
We version every schema contract. When a site change breaks a selector, our monitoring catches the completeness drop within minutes. We patch the selector, bump the schema version, and backfill any records that were affected during the gap. Clients receive a changelog entry; their data contract doesn't change.
$ dataflirt scope --new-project --target=data-extraction READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h