← Glossary / Column-Level Lineage

What is Column-Level Lineage?

Column-level lineage is the granular tracking of data flow from a source system field through every transformation, join, and aggregation down to the final destination column. In scraping pipelines, it is the audit trail that proves a specific price or product attribute in your data warehouse originated from a specific DOM element on a target site, rather than a hallucinated fallback or a stale cache. When downstream analytics break, column-level lineage tells you exactly which upstream extraction rule failed.

Data GovernanceETLData ProvenanceSchema TrackingAudit Trail
// 02 — definitions

Trace the
bytes.

Table-level lineage tells you a pipeline ran. Column-level lineage tells you exactly where the anomalous price in row 42 came from.

Ask a DataFlirt engineer →

TL;DR

Column-level lineage maps the exact path of a single attribute from raw extraction to final delivery. Without it, debugging a downstream data anomaly requires reverse-engineering the entire ETL pipeline. Modern data stacks use tools like dbt and DataHub to maintain this mapping, ensuring compliance and accelerating root-cause analysis when schemas drift.

01Definition & structure
Column-level lineage is a metadata graph that maps the exact journey of a specific data attribute from its origin to its final destination. While table-level lineage shows macro dependencies, column-level lineage parses SQL, Python, and extraction configurations to link individual fields. It answers the question: "If I change this specific CSS selector in the scraper, which specific columns in the executive dashboard will break?"
02How it works in practice
Modern data stacks generate lineage by parsing the Abstract Syntax Tree (AST) of transformation code. When a dbt model runs SELECT price AS usd_price, the parser records the link. In a scraping context, the lineage graph must extend beyond the warehouse. The extraction engine must log the exact DOM node, API JSON path, or regex group that populated the raw landing table, bridging the gap between the public web and the internal data model.
03The compliance mandate
For enterprise data teams, column-level lineage is not just a debugging tool; it is a regulatory requirement. Under GDPR and CCPA, if a scraped dataset inadvertently captures PII (like an author's email address), you must be able to locate every downstream system that ingested that specific column to execute a deletion request. Without column-level lineage, compliance requires nuking entire tables or relying on manual audits.
04How DataFlirt handles it
We treat provenance as a first-class data type. Every record delivered by DataFlirt includes a metadata sidecar detailing the extraction schema version, the specific selector used for each field, and the timestamp of the fetch. We integrate directly with enterprise data catalogs (like DataHub and Alation) so your data engineering team can trace a warehouse anomaly straight back to our extraction logs without leaving their native tooling.
05The "SELECT *" anti-pattern
The fastest way to destroy a lineage graph is using SELECT * in your transformation layer. Because the columns are not explicitly named in the code, static parsers cannot map the dependencies. When the upstream scraper adds or removes a field, the downstream view breaks silently or propagates garbage data. Explicit column declaration is the foundational rule of maintaining a healthy lineage graph.
// 03 — the lineage model

How complex
is the graph?

Lineage complexity scales non-linearly with the number of transformations. DataFlirt tracks the exact provenance of every delivered attribute to guarantee auditability from the warehouse back to the raw HTML.

Lineage Depth = D = transformations + system_hops
The number of steps a field takes from source to destination. Data Engineering standard
Blast Radius = R = Σ downstream_dependencies(c)
The number of dashboards or models that break when column c changes. Data Governance metric
Provenance Confidence = C = mapped_columns / total_columns
DataFlirt maintains C = 1.0 for all managed extraction pipelines. DataFlirt internal SLO
// 04 — lineage trace

Tracing a price
back to the DOM.

A downstream consumer flagged an anomalous price. The lineage graph traces the delivered column back through the warehouse, the transform layer, and the raw extraction job.

dbt-coreSnowflakeDataFlirt Extraction
edge.dataflirt.io — live
CAPTURED
// Querying lineage for fct_pricing.usd_price
target: "snowflake.analytics.fct_pricing.usd_price"

// Step 1: dbt transformation
upstream_model: "stg_products"
transform: "CAST(raw_price AS DECIMAL(10,2)) * exchange_rate"

// Step 2: Raw landing zone
upstream_table: "s3_raw.product_scrapes_v7.raw_price"
ingest_job: "df-load-042"

// Step 3: DataFlirt extraction payload
pipeline_id: "extract-mfg-IN-017"
selector_used: "div.price-box > span.current-price"
raw_value: "₹72,400/MT"
timestamp: "2026-05-19T08:14:22Z"

// Conclusion
root_cause: Currency symbol changed on target site, breaking regex
action: Update extraction schema v8
// 05 — lineage breakage

Where the graph
goes dark.

Ranked by frequency of lineage loss across enterprise data pipelines. The most common breaks happen at the boundaries between systems where metadata is not passed along.

PIPELINES AUDITED ·  ·    1,200+
GRAPH COMPLETENESS ·  ·   Industry avg 62%
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

SELECT * views

% of breaks · Obscures explicit column dependencies in SQL
02

Undocumented Python scripts

% of breaks · Pandas transformations outside the DAG
03

JSON flattening

% of breaks · Dynamic keys create untrackable columns
04

Manual CSV uploads

% of breaks · Breaks the automated provenance chain
05

Cross-system API calls

% of breaks · Enrichment steps without metadata logging
// 06 — our architecture

Track the field,

not just the file.

DataFlirt embeds lineage metadata directly into the extraction payload. When we parse a DOM element, the resulting JSON record includes the exact CSS selector, timestamp, and worker ID that generated it. This metadata flows through our transformation layer and is exposed in the final delivery manifest. If a downstream model flags an anomaly, you do not just know which pipeline ran — you know exactly which line of HTML produced the value.

Lineage Metadata Payload

Embedded provenance for a single extracted price field.

field.name price_usd
source.url https://target.com/p/123
extraction.rule css: .price-val
worker.id df-node-88a2
schema.version v7.2
validation.status passed

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About lineage tracking, compliance mandates, debugging, and how DataFlirt maintains provenance across complex scraping pipelines.

Ask us directly →
What is the difference between table-level and column-level lineage? +
Table-level lineage shows that Table A feeds into Table B. It is useful for scheduling jobs. Column-level lineage shows that Column X in Table A populates Column Y in Table B. It is essential for debugging data quality issues, assessing the impact of schema changes, and proving compliance.
Why is column-level lineage hard to implement? +
It requires parsing the actual transformation logic — SQL queries, Python scripts, or extraction rules — to understand how fields map to each other. A simple SELECT * or a dynamic JSON flattening operation can completely break automated lineage parsers because the output columns are not explicitly defined in the code.
How does column-level lineage help with GDPR or CCPA compliance? +
Privacy regulations require you to know exactly where Personally Identifiable Information (PII) resides and how it is used. Column-level lineage allows you to tag a source column as PII and automatically propagate that tag to every downstream table, dashboard, and machine learning model that consumes it.
How does DataFlirt maintain lineage when target site schemas drift? +
We version our extraction schemas. When a target site changes and a selector breaks, we update the rule and bump the schema version. The delivered data includes this version metadata. Downstream consumers can query the lineage graph to see exactly which records were extracted under version 1 versus version 2.
Does tracking lineage slow down the extraction pipeline? +
No. Lineage metadata is generated statically at pipeline compile time and appended as lightweight headers or JSON sidecars during runtime. The computational overhead is negligible. The real cost is in the storage and querying of the lineage graph in the data warehouse, not the extraction layer.
What happens when a downstream column is derived from multiple upstream columns? +
The lineage graph branches. If profit = revenue - cost, the profit column has two upstream dependencies. A robust lineage tool will show both paths. When debugging profit, you must trace both the revenue extraction rule and the cost extraction rule to find the anomaly.
$ dataflirt scope --new-project --target=column-level-lineage READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h