← Glossary / Dimension Table

What is Dimension Table?

Dimension tables store the descriptive attributes—the "who, what, where, and when"—that provide context to the quantitative metrics found in fact tables. In a scraping pipeline, while the fact table records every price observation or stock change, the dimension table holds the stable product metadata, category hierarchies, and seller details. Designing them correctly prevents massive data duplication and makes downstream analytics queries fast and intuitive.

Data ModelingStar SchemaSCDData Warehousedbt
// 02 — definitions

Context for
the facts.

Without dimensions, your scraped data is just a meaningless stream of numbers and timestamps. Dimensions give those numbers a name, a category, and a history.

Ask a DataFlirt engineer →

TL;DR

A dimension table contains the descriptive text fields that analysts use to filter, group, and label data. In a star schema, they surround the central fact table. Handling how these attributes change over time—using Slowly Changing Dimensions (SCD)—is the hardest part of modeling scraped data.

01Definition & structure
A dimension table is a structural component of a data warehouse that stores the descriptive attributes related to a business entity. While fact tables store numbers (prices, quantities, clicks), dimension tables store text (names, categories, locations, dates). They are the tables you use to GROUP BY and WHERE in your SQL queries. Every dimension table must have a unique primary key—ideally a surrogate key—that the fact table uses as a foreign key.
02Fact vs. Dimension
The easiest way to separate them is grammar: facts are verbs (events that happen), dimensions are nouns (the entities involved). A scrape event is a verb—it yields a price observation (fact). The product being scraped, the seller offering it, and the date it was scraped are all nouns (dimensions). Keeping them separate ensures that when a seller changes their name, you update one row in the seller dimension, not ten million rows in the fact table.
03Slowly Changing Dimensions (SCD)
Web data is not static. E-commerce sites re-categorize products; companies change their headquarters. When a dimension attribute changes, you have a choice. You can overwrite the old value (SCD Type 1), which rewrites history. Or, you can insert a new row with the updated attribute and a new surrogate key, using timestamp columns to track when each version was active (SCD Type 2). For compliance and accurate historical reporting, SCD Type 2 is mandatory.
04How DataFlirt models scraped dimensions
We don't just dump raw JSON into your warehouse. Our delivery pipelines automatically separate scraped payloads into facts and dimensions. We generate deterministic surrogate keys based on the natural key and the schema version, apply SCD Type 2 logic to track site taxonomy changes, and handle late-arriving dimensions by injecting skeleton records. Your analysts get clean, query-ready star schemas from day one.
05The late-arriving dimension problem
In distributed scraping, you might discover a new product ID on a category listing page and immediately scrape its price (a fact), but the crawler hasn't yet visited the product detail page to get the brand and description (the dimension). If you load the fact without the dimension, inner joins will silently drop the data. The standard fix is to insert a "stub" or "skeleton" dimension row with the ID and NULL attributes, which gets updated once the detail page is finally scraped.
// 03 — dimension math

Sizing up
your dimensions.

Dimension tables are typically wide (many columns) but shallow (fewer rows than fact tables). However, SCD Type 2 tracking can cause row counts to multiply over time if volatile fields are incorrectly modeled as dimensions.

SCD Type 2 Row Growth = Rtotal = Entities × (1 + Update_Rate × Time)
Every attribute change creates a new row. High update rates cause dimension explosion. Kimball Dimensional Modeling
Join Cardinality = CostFact_Rows × log(Dim_Rows)
Hash join complexity in modern columnar warehouses like Snowflake or BigQuery. Query Optimizer Heuristics
Surrogate Key Space = Bits = log2(Max_Expected_Rows)
Always use 64-bit integers (BIGINT) for surrogate keys to prevent overflow. Data Engineering Best Practices
// 04 — dbt execution

Merging a slowly
changing dimension.

A trace of a dbt run updating a product dimension table with fresh scraped metadata, applying SCD Type 2 logic to track category changes over time.

dbt-coreSCD Type 2Snowflake
edge.dataflirt.io — live
CAPTURED
// execution context
command: dbt snapshot --select dim_product
target: snowflake_prod

// snapshot logic
strategy: check
check_cols: ['category', 'brand', 'title']
unique_key: "scraped_product_id"

// merge operation
records.scanned: 1,420,550
records.unchanged: 1,418,100
records.changed: 2,450 // attributes drifted
records.new: 12,040

// applying SCD Type 2
action.update: set valid_to = CURRENT_TIMESTAMP // close old records
action.insert: insert new versions with valid_from = CURRENT_TIMESTAMP

status: SUCCESS in 14.2s
// 05 — modeling pitfalls

Where dimension
models break.

Ranked by frequency of occurrence in downstream analytics pipelines. Poor dimension modeling usually manifests as silent query errors or exploding warehouse costs.

PIPELINES MONITORED ·   300+ active
PRIMARY FAILURE ·  ·  ·   Natural key reuse
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Missing surrogate keys

Data integrity · Relying on scraped IDs that the target site eventually reuses.
02

SCD Type 2 explosion

Cost / Storage · Tracking rapidly changing fields (like stock status) in a dimension.
03

Late-arriving dimensions

Data loss · Dropping facts because the dimension record hasn't been scraped yet.
04

Over-normalization

Performance · Snowflaking too much, causing join hell for analysts.
05

Inconsistent granularity

Logic error · Mixing product-level and SKU-level attributes in one table.
// 06 — pipeline architecture

Separate the nouns,

from the verbs.

When scraping e-commerce catalogs, the temptation is to write one massive flat table containing both the product name and the daily price. This is an anti-pattern. We split the pipeline: a daily fact table captures the price and stock observations, while a dimension table tracks the product metadata. This reduces storage costs, speeds up analytical queries, and allows us to track attribute changes over time without duplicating millions of rows.

dim_product_scd2

Schema for a Slowly Changing Dimension (Type 2) tracking scraped products.

product_sk BIGINTsurrogate key
scraped_id VARCHARnatural key
title VARCHAR
category VARCHAR
valid_from TIMESTAMP
valid_to TIMESTAMPnull if current
is_current BOOLEANpartition key

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About dimension modeling, surrogate keys, slowly changing dimensions, and how DataFlirt structures delivered datasets.

Ask us directly →
What is a surrogate key? +
A surrogate key is an artificially generated primary key (usually an auto-incrementing integer or a UUID) used to uniquely identify a row in a dimension table. You should never use the scraped natural key (like a product ID) as the primary key, because target sites often reuse IDs, or you may need to store multiple versions of the same product over time (SCD Type 2).
Why not just use a wide flat table for everything? +
Flat tables (One Big Table or OBT) are fine for small datasets or final BI presentation layers. But for raw storage and pipeline processing, they create massive data duplication. If you scrape 10 million prices a day, storing the 500-character product description on every single price row wastes storage, slows down queries, and makes it impossible to update a category name without rewriting millions of historical rows.
What is a Slowly Changing Dimension (SCD)? +
SCD is a methodology for handling updates to dimension attributes. Type 1 overwrites the old value (losing history). Type 2 adds a new row with valid_from and valid_to timestamps, preserving history. Type 3 adds a new column for the previous value. In scraping pipelines, Type 2 is the gold standard for tracking how target sites change their taxonomies or product details over time.
How do you handle a dimension that changes every day? +
If an attribute changes every day (like price, stock status, or rank), it is not a dimension—it is a fact. Move it to the fact table. Dimension tables should only contain attributes that are relatively stable. Putting volatile fields in an SCD Type 2 dimension table will cause a row-count explosion.
What happens if a scraped fact arrives before its dimension data? +
This is the "late-arriving dimension" problem. If you scrape a price for a new product ID, but the crawler hasn't scraped the product details page yet, a standard inner join will drop the price fact. The solution is to insert a skeleton row into the dimension table with the new natural key and "Unknown" for all attributes, which gets updated once the full dimension data is scraped.
How does DataFlirt deliver dimension tables? +
We deliver fully modeled star schemas. Our extraction layer separates volatile observations (facts) from stable metadata (dimensions). We handle the surrogate key generation, apply SCD Type 2 logic to track site changes, and deliver the modeled tables directly to your Snowflake, BigQuery, or S3 bucket, ready for immediate analytical querying.
$ dataflirt scope --new-project --target=dimension-table READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h