← Glossary / Data Lineage

What is Data Lineage?

Data lineage is the complete historical record of a dataset's lifecycle — mapping its origin, transformations, and movements from raw extraction to final delivery. In scraping pipelines, it answers the critical question of why a specific field holds a specific value. Without lineage, debugging a downstream anomaly means guessing which extraction rule, type coercion, or deduplication step corrupted the record. With it, you have a deterministic audit trail that makes schema drift and parsing errors immediately traceable.

Data EngineeringProvenanceAudit TraildbtPipeline Observability
// 02 — definitions

Trace the
mutation.

The map of how raw HTML bytes become structured business intelligence, and why blind data is a liability.

Ask a DataFlirt engineer →

TL;DR

Data lineage tracks the flow of data from source to destination. It records the exact timestamp, scraper version, raw payload, and transformation logic applied to every field. In production, it's the difference between knowing a price is wrong and knowing exactly which CSS selector update broke it.

01Definition & structure

Data lineage is the comprehensive map of a dataset's lifecycle. It records the origin of the data, what happens to it, and where it moves over time. In a modern data stack, lineage is typically represented as a Directed Acyclic Graph (DAG) where nodes are datasets or tables, and edges are the transformation jobs that move data between them.

A robust lineage system tracks:

  • Origin: The exact source URL, API endpoint, or raw file.
  • Transformations: Type coercions, string manipulations, and joins.
  • Movement: Which systems the data passed through (e.g., Kafka → S3 → Snowflake).
  • State: The schema version and scraper version active at the time of capture.
02How it works in practice

In a scraping context, lineage starts the moment an HTTP response is received. The raw HTML is saved to an object store, and a unique job_id is generated. As the HTML is parsed, the extraction script appends metadata to the structured record. When the record is loaded into a data warehouse, tools like dbt take over, tracking every SQL transformation applied to the raw data to produce the final analytical tables. If a dashboard shows a sudden spike in prices, the data engineer follows the lineage graph backward from the dashboard, to the dbt model, to the raw table, and finally to the exact HTML payload to see if the target site changed its layout.

03Table vs. Column-level lineage

Table-level lineage is a high-level view: it tells you that table_B is built using data from table_A. This is useful for impact analysis (e.g., "If I drop table_A, what breaks?").

Column-level lineage is granular: it tells you that table_B.discounted_price is derived specifically from table_A.raw_price minus table_A.discount_amount. For scraping pipelines, column-level lineage is critical because schema drift usually affects individual fields (like a changed CSS class for a price tag) rather than entire pages.

04How DataFlirt handles it

We build lineage into the pipeline architecture from day one. Every extraction job generates a unique run ID. The raw HTTP response is gzipped and stored in S3, keyed by that run ID. The extracted JSON records include a metadata block containing the run ID, the schema version, and the scraper version. When we deliver data to your warehouse, this metadata is preserved. If you ever question a data point, you have the cryptographic proof and the raw source material to audit our extraction logic.

05The compliance mandate

Beyond debugging, data lineage is a regulatory necessity. Under frameworks like GDPR and CCPA, organizations must be able to demonstrate where personal data originated and how it propagated through their systems. If a user requests deletion, you must trace their data through every downstream table and cache. Without automated lineage, compliance audits become manual, error-prone forensic investigations.

// 03 — lineage metrics

Measuring
traceability.

Lineage isn't just a graph; it's a measurable property of a pipeline. DataFlirt tracks lineage completeness to ensure every delivered record can be audited back to its raw HTTP response.

Lineage Completeness = C = traced_attributes / total_attributes
A score of 1.0 means every field maps to a known source and transform. Data Governance Standard
Mean Time to Root Cause = MTTR = t_identified t_reported
High-fidelity lineage reduces MTTR from days to minutes. DataOps SLO
Transformation Depth = D = Σ transform_steps / record_count
Measures the complexity of the ETL/ELT pipeline per record. Pipeline Observability
// 04 — lineage trace

From raw DOM
to Snowflake.

A column-level lineage trace for a single price field, showing the exact journey from a scraped HTML node to a delivered integer.

column-leveldbt-coreaudit log
edge.dataflirt.io — live
CAPTURED
// 1. source fetch
job.id: "scrape-in-042"
source.url: "https://target.com/p/123"
raw.bytes: 142,048

// 2. extraction
node.selector: ".price-tag"
extracted.raw: "₹1,299.00"
scraper.version: "v2.4.1"

// 3. transformation (dbt)
model: "stg_prices"
step.strip_currency: "1,299.00"
step.cast_numeric: 1299.00
step.currency_code: "INR"

// 4. delivery
sink: "snowflake.raw_db.prices"
status: 200 OK
// 05 — lineage gaps

Where the trail
goes cold.

The most common points of failure in data lineage tracking across scraping pipelines. When these links break, data becomes untrusted.

PIPELINES AUDITED ·  ·    180+ active
TRACE DEPTH ·  ·  ·  ·    column-level
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Undocumented manual overrides

silent failure · Ad-hoc scripts fixing data outside the DAG
02

Missing raw payload retention

audit failure · Unable to verify if the source or the parser was wrong
03

Opaque type coercions

transform gap · Implicit casting dropping precision silently
04

Unversioned extraction schemas

schema drift · Selectors change without bumping the contract version
05

Cross-system ID mismatches

join failure · Losing the primary key between extraction and warehouse
// 06 — DataFlirt's lineage graph

Every field audited,

back to the exact HTTP response.

We treat data lineage as a first-class deliverable. Every record DataFlirt pushes to your warehouse includes metadata linking it to the specific extraction job, the schema version active at the time, and a pointer to the raw HTML payload in cold storage. If a downstream model flags an anomaly, you don't have to guess — you can query the exact state of the target site at the millisecond the data was captured.

lineage_metadata.json

Standard metadata payload appended to every delivered record.

record.id rec_8f92a1b
job.run_id run_0942
schema.version v4.1.0
raw_payload.s3_uri s3://df-cold/2026/05/19/raw.gz
transform.hash a7d8e9f
lineage.status fully-traced

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About data lineage, provenance, debugging pipelines, and how DataFlirt ensures traceability at scale.

Ask us directly →
What is the difference between data lineage and data provenance? +
They are closely related but distinct. Data provenance focuses on the origin — where the data came from and who created it. Data lineage encompasses provenance but also maps the entire journey: every transformation, movement, and state change from origin to the final destination. Provenance is the birth certificate; lineage is the full biography.
Why do I need column-level lineage for scraped data? +
Table-level lineage tells you that a table was populated by a specific scraper. Column-level lineage tells you that the price_usd column was derived from the .price-main CSS selector, stripped of the currency symbol, and multiplied by an exchange rate table. When a single field breaks, column-level lineage isolates the exact transform step responsible.
How long should we retain raw HTML for lineage purposes? +
For active pipelines, retaining raw payloads for 7 to 30 days is standard. This provides a sufficient window to detect downstream anomalies, trace them back to the source, and replay the extraction with fixed selectors if necessary. Cold storage in S3/GCS makes this highly cost-effective.
How does DataFlirt expose lineage to clients? +
We append a metadata object to every delivered record containing the job ID, timestamp, schema version, and a pointer to the raw payload. For enterprise clients, we also provide dbt-compatible lineage graphs that integrate directly into your existing data catalog (like Atlan or Datahub).
Does tracking lineage slow down the extraction pipeline? +
Not if architected correctly. Lineage metadata is generated passively during the extraction and transformation phases. The only overhead is writing the raw payloads to cold storage and appending a few bytes of metadata to the final JSON/Parquet records. The debugging time saved vastly outweighs the negligible compute cost.
How does lineage help with GDPR/CCPA compliance? +
Compliance requires knowing exactly what personal data you hold, where it came from, and how it is used. Lineage provides the audit trail to prove that a specific record was scraped from a public directory on a specific date, under a specific legal basis, and tracks where that data propagated within your internal systems.
$ dataflirt scope --new-project --target=data-lineage READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h