← Glossary / Slowly Changing Dimension

What is Slowly Changing Dimension?

Q: How do you handle late-arriving data in an SCD Type 2 model?

Late-arriving data requires recalculating the valid_from and valid_to dates for the affected timeline. Modern table formats like Apache Iceberg or Delta Lake make this easier with MERGE operations, but it's computationally expensive. It's best to ensure scrapes are processed in strict chronological order.

Slowly changing dimension (SCD) is a data modeling technique used to manage and track historical changes in categorical data over time. In scraping pipelines, it's how you handle a product changing its category, a company moving headquarters, or a SKU updating its packaging. Instead of silently overwriting the past or creating duplicate chaos, SCD provides a structured framework for preserving state history so downstream analytics remain accurate.

Data ModelingData WarehouseState HistoryETLSCD Type 2

// 02 — definitions

Time-travel
for records.

How to store data that is mostly static but occasionally mutates, without destroying the historical context of your analytics.

Ask a DataFlirt engineer →

TL;DR

A slowly changing dimension defines how a data warehouse handles updates to existing records. Type 1 overwrites the old value, losing history. Type 2 creates a new row with effective dates, preserving the timeline. For scraped datasets where tracking change is the entire point — like monitoring competitor pricing or executive turnover — Type 2 is the industry standard.

01Definition & structure

A slowly changing dimension (SCD) is a concept from Kimball dimensional modeling that dictates how to handle updates to existing records in a data warehouse. When an attribute of an entity changes (e.g., a product's price, a company's address), you must decide whether to overwrite the old data, track the history, or ignore the change. The structure usually involves a surrogate key (an internal ID), a natural key (the source ID), and metadata columns to track the record's lifespan.

02The SCD Types

While there are up to 7 types, three dominate data engineering:

Type 1 (Overwrite): The old value is replaced. History is lost. Used for correcting errors (e.g., fixing a misspelled name).
Type 2 (Add Row): A new row is inserted with the new data. The old row is marked as expired using valid_from and valid_to dates. History is perfectly preserved.
Type 3 (Add Column): A new column is added for the "previous" value. Only tracks the current and immediately preceding state. Rarely used today.

03Why scraped data demands Type 2

When you buy or build a scraping pipeline, you are usually doing it to monitor change. If you scrape a competitor's catalog daily and use SCD Type 1, you will only ever know their prices today. You cannot run a query to see what their prices were during Black Friday last year. SCD Type 2 is mandatory for any pipeline where the historical state of the target is a business requirement.

04The surrogate key problem

Because SCD Type 2 creates multiple rows for the same entity, you cannot use the source's ID (the natural key) as the primary key in your database. You must generate a surrogate key — usually an auto-incrementing integer or a UUID hash — to uniquely identify each specific historical state of the entity. The natural key is retained to link the historical rows together.

05Did you know?

The standard convention for an active record's valid_to date is 9999-12-31. This allows analysts to query the current state using a simple BETWEEN clause without having to handle NULL values, which behave unpredictably in SQL joins and indexing.

// 03 — the logic

How to model
state changes.

Implementing SCD Type 2 requires specific metadata columns to track the active lifespan of a record. Here is the standard logic used when delivering historical state datasets.

Active Record Condition = is_current = TRUE AND valid_to = '9999-12-31'

The standard filter to query only the present state of the dimension. Kimball Dimensional Modeling

Type 2 Update Logic = UPDATE dim SET valid_to = T_now WHERE id = ID AND is_current = TRUE

Expire the old record immediately before inserting the new mutated record. Standard ETL pattern

Storage Growth Rate = S_total = S_initial + (S_initial × MutationRate × Time)

Type 2 tables grow linearly with the mutation rate. High mutation = bloat. Data Engineering heuristics

// 04 — the data contract

A product category
changes over time.

A trace of an SCD Type 2 implementation handling a scraped product that was reclassified by the target retailer. History is preserved via surrogate keys and date ranges.

SCD Type 2Surrogate KeysDelta Merge

edge.dataflirt.io — live

CAPTURED

// T0: Initial scrape ingestion
INSERT INTO dim_product (sk, sku, category, valid_from, valid_to, is_current)
VALUES (101, 'XJ-9', 'Electronics', '2025-01-01', '9999-12-31', TRUE)

// T1: Scraper detects category change to 'Smart Home'
// Step 1: Expire the old record
UPDATE dim_product SET valid_to = '2025-06-15', is_current = FALSE
WHERE sk = 101

// Step 2: Insert the new record
INSERT INTO dim_product (sk, sku, category, valid_from, valid_to, is_current)
VALUES (102, 'XJ-9', 'Smart Home', '2025-06-15', '9999-12-31', TRUE)

// Querying historical state (What was the category in March?)
SELECT category FROM dim_product
WHERE sku = 'XJ-9' AND '2025-03-01' BETWEEN valid_from AND valid_to
> "Electronics" // History successfully preserved

// 05 — implementation risks

Where SCD pipelines
break down.

Managing state history across millions of scraped records introduces specific engineering challenges. Ranked by frequency of pipeline failure across historical datasets.

PIPELINES · · · · · 120+ historical

MODEL · · · · · · Type 2 dominant

UPDATED · · · · · · 2026-05-19

Natural key instability

92% of failures · Target site changes their internal ID, breaking lineage

Late-arriving data

78% of failures · Out-of-order scrapes corrupting the valid_from timeline

Unbounded table growth

65% of failures · High-mutation fields bloating the dimension table

Overlapping date ranges

45% of failures · Logic errors causing multiple 'current' records

Schema drift

34% of failures · Target adds a field, requiring historical backfills

// 06 — delivery architecture

Deliver the delta,

let the warehouse handle the history.

Scraping pipelines shouldn't run UPDATE statements on your production data warehouse. Instead, DataFlirt delivers append-only delta files containing the exact state of the target at the time of the scrape. Your ingestion layer — using tools like dbt or Apache Iceberg — consumes these deltas and computes the SCD Type 2 transitions natively. This decouples the extraction logic from your storage modeling, ensuring that if a pipeline hiccups, your historical timeline remains immutable and easily recalculable.

SCD Delivery Payload

Metadata appended to a scraped record to facilitate downstream SCD merging.

record.hash a7f8b9...

scrape.timestamp 2026-05-19T08:14:22Z

mutation.detected true

changed_fields ["price", "stock_status"]

delivery.format parquet

merge.strategy append_only

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about modeling scraped data, choosing SCD types, and handling historical state.

Ask us directly →

What is the difference between SCD Type 1 and Type 2? +

Type 1 overwrites the existing record with new data — history is lost, but the table stays small. Type 2 creates a new record with a new surrogate key and effective dates, preserving the old record. Use Type 1 for corrections (fixing a typo); use Type 2 for actual state changes (price updates, category shifts).

Should I use SCD Type 2 for rapidly changing data like stock prices? +

No. Rapidly changing data belongs in a Fact table, not a Dimension table. If a value changes every minute, it's a metric or an event. Dimensions should be reserved for attributes that change slowly — like a company's CEO, a product's category, or a store's physical address.

How do you handle a target site changing their internal IDs? +

This is the hardest problem in scraped data modeling. If the natural key (the target's ID) changes, the SCD lineage breaks. We mitigate this by generating a composite natural key based on immutable attributes (e.g., URL slug + brand + MPN) to maintain continuity even if the target's database ID shifts.

What happens if a scrape fails and we miss a state change? +

The next successful scrape will detect the new state and close out the old record. The valid_to date of the old record will simply extend until the new scrape occurs. You lose the exact timestamp of the transition, but the current state self-corrects and the timeline remains contiguous.

Does DataFlirt maintain the SCD tables for me? +

We can, but we recommend delivering raw deltas to your cloud storage (S3/GCS) and letting your data engineering team manage the SCD logic via dbt, Snowflake, or BigQuery. This gives you complete control over the business logic of what constitutes a "meaningful" change.

How do you handle late-arriving data in an SCD Type 2 model? +

Late-arriving data requires recalculating the valid_from and valid_to dates for the affected timeline. Modern table formats like Apache Iceberg or Delta Lake make this easier with MERGE operations, but it's computationally expensive. It's best to ensure scrapes are processed in strict chronological order.

$ dataflirt scope --new-project --target=slowly-changing-dimension READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h

What is Slowly Changing Dimension?

Time-travelfor records.

TL;DR

How to modelstate changes.

A product categorychanges over time.

Where SCD pipelinesbreak down.

Natural key instability

Late-arriving data

Unbounded table growth

Overlapping date ranges

Schema drift

Deliver the delta,

SCD Delivery Payload

Stay ahead of the pipeline

Data engineeringintel, weekly.

Commonquestions.

Tell us whatto extract.We do the rest.

Related glossary terms

Dimension Table

Fact Table

Change Data Capture (CDC)

dbt (Data Build Tool)