← Glossary / Star Schema

What is Star Schema?

Star schema is a dimensional modeling technique where a central fact table containing measurable, quantitative data is surrounded by denormalized dimension tables containing descriptive attributes. For data engineering teams building scraping pipelines, it's the target architecture for transforming raw, nested JSON payloads into query-optimized analytical datasets. It trades storage efficiency for read performance, ensuring downstream BI tools don't choke on complex joins.

Data ModelingOLAPFact TablesDimension TablesDenormalization
// 02 — definitions

Optimize for
the read.

Why analytical databases prefer redundant data over complex joins, and how scraped records map to this architecture.

Ask a DataFlirt engineer →

TL;DR

A star schema separates business events (facts) from the context surrounding them (dimensions). When scraping e-commerce data, the price observation is the fact, while the product details, seller info, and scrape timestamp are dimensions. It's the standard output format for DataFlirt's structured delivery pipelines targeting Snowflake or BigQuery.

01Definition & structure

A star schema is a relational database schema designed for analytical processing (OLAP). It consists of a central fact table containing quantitative data (metrics, measurements, observations), surrounded by multiple dimension tables containing descriptive attributes.

The fact table holds foreign keys that map to the primary keys of the dimension tables. Visually, this creates a star shape. The design intentionally denormalizes data in the dimension tables to minimize the number of joins required at query time, optimizing for read speed over storage efficiency.

02How it works in practice

When a scraping pipeline extracts a complex JSON object from a website, that object contains a mix of facts and dimensions. For example, a real estate scrape yields the price (fact), the property details (dimension), and the agent details (dimension).

The transformation layer splits this payload. It hashes the property and agent details to generate surrogate keys, upserts those records into dim_property and dim_agent, and then inserts a single row into fact_listing containing the price and the surrogate keys. Analysts can then sum, average, or group the facts by joining to the dimensions.

03Facts vs. Dimensions

The easiest way to separate the two is grammatical: facts are verbs, dimensions are nouns.

  • Facts: A product was priced, a flight was scheduled, a review was posted. These tables are deep (billions of rows) and narrow (mostly numbers and keys).
  • Dimensions: The product, the airline, the author, the date. These tables are shallow (thousands of rows) and wide (dozens of text columns).
04How DataFlirt handles it

We don't just deliver raw JSON dumps. Our delivery pipelines include an embedded transformation engine that maps scraped fields directly to your warehouse's star schema. We handle the surrogate key generation, the SCD Type 2 historical tracking, and the fact table inserts.

This means your data engineering team doesn't have to write and maintain complex dbt models just to parse our output. The data lands in your Snowflake or BigQuery instance ready for immediate use by BI tools.

05The storage trade-off

A star schema intentionally duplicates data. If 10,000 products belong to the "Electronics" category, the string "Electronics" is stored 10,000 times in the dim_product table, rather than being normalized into a separate category table.

In the 1990s, this was a problem because disk space was expensive. Today, cloud storage is practically free, while warehouse compute credits are expensive. Denormalizing dimensions saves compute by eliminating joins, making the star schema the optimal architecture for modern data stacks.

// 03 — query cost

Why joins
matter.

The mathematical justification for dimensional modeling. Reducing the number of joins exponentially decreases the computational complexity of analytical queries, saving compute credits in cloud data warehouses.

Join Complexity (Baseline) = O(N) = R · S
Nested loop join cost between two normalized tables. Relational Algebra
Star Schema Query Time = T = Tscan_fact + Σ Thash_dim
Hash joins dominate. Fact table is scanned once, dimensions are hashed in memory. OLAP Execution Models
Storage Overhead = Sstar1.4 · S3NF
Denormalization penalty. Storage is cheap; compute is expensive. Kimball Group Estimates
// 04 — pipeline transformation

JSON payload to
dimensional model.

Trace of a dbt model transforming a raw scraped e-commerce payload into a fact and dimension record before loading into the warehouse.

dbt runupsertsurrogate keys
edge.dataflirt.io — live
CAPTURED
// 1. raw ingestion
source.record: {"sku":"A12", "price":49.99, "category":"Tools"}

// 2. dimension upsert (dim_product)
dbt.run: MERGE INTO dim_product USING raw_data
action: INSERT
row: product_sk="sk_98f2", sku="A12", category="Tools"

// 3. dimension upsert (dim_date)
row: date_sk="20260519", date="2026-05-19"

// 4. fact insert (fact_price_observation)
dbt.run: INSERT INTO fact_price_observation
keys: product_sk="sk_98f2", date_sk="20260519"
metrics: price=49.99, discount_pct=0.00
status: SUCCESS
query_time: 142ms
// 05 — schema design

Where dimensional
models fail.

Common anti-patterns when mapping scraped data to a star schema, ranked by frequency of occurrence in client data warehouses.

PIPELINES AUDITED ·  ·    140+ active
PRIMARY SINK ·  ·  ·  ·   Snowflake / BigQuery
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Fact table bloat

92% of failures · Putting descriptive text directly in the fact table
02

SCD mismanagement

78% of failures · Overwriting history instead of tracking changes
03

Over-normalization

65% of failures · Snowflaking dimensions unnecessarily
04

Surrogate key collisions

41% of failures · Poor hashing logic on natural keys
05

Granularity mismatches

34% of failures · Mixing daily and hourly grains in one fact
// 06 — delivery architecture

Flatten the nested data,

before it hits the warehouse.

Scraping naturally produces highly nested, hierarchical data structures. Dumping this directly into a data warehouse forces analysts to use expensive JSON-parsing functions at query time. DataFlirt's delivery layer applies dimensional modeling in-flight, flattening arrays and resolving surrogate keys so the data lands in your S3 bucket or Snowflake instance as a ready-to-query star schema.

Delivery schema validation

Live status of a dimensional transform job on a retail pricing pipeline.

job.id transform-retail-042
records.processed 84,192
dim_product.upsert 1,204 new82,988 unchanged
fact_price.insert 84,192 rows
surrogate.collisions 0
schema.grain daily_observation
output.destination snowflake://df_prod/star

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About dimensional modeling, handling nested scraped data, and how DataFlirt structures data for analytical workloads.

Ask us directly →
What is the difference between a star schema and a snowflake schema? +
In a star schema, dimension tables are fully denormalized (e.g., a single dim_product table contains category and brand names). In a snowflake schema, dimensions are normalized into sub-dimensions (e.g., dim_product links to dim_category). Star schemas are generally preferred for modern columnar databases because they require fewer joins.
Why not just use One Big Table (OBT)? +
One Big Table (OBT) is increasingly popular in columnar databases like BigQuery, but it struggles with tracking historical changes to attributes. If a product changes its category, an OBT requires updating every historical row for that product, or accepting inconsistent data. A star schema handles this elegantly using Slowly Changing Dimensions (SCD Type 2).
How do you handle scraped data that changes over time? +
We use Slowly Changing Dimensions (SCD Type 2). When a scraped attribute (like a product's title) changes, we don't overwrite the old record. We expire the old dimension row by setting an end_date, and insert a new row with a new surrogate key and a start_date. The fact table always links to the dimension version that was active at the time of the scrape.
What is a surrogate key and why use it? +
A surrogate key is an artificially generated primary key (usually an integer or hash) used in the data warehouse, rather than relying on the source website's natural key (like an SKU). It insulates your warehouse if the target site recycles IDs, changes their format, or if you need to track multiple historical versions of the same natural key.
Can DataFlirt deliver directly into our star schema? +
Yes. We can deliver raw JSON to an S3 bucket, but our enterprise pipelines typically handle the ELT directly. We map the scraped fields to your target schema, generate the surrogate keys, and execute the upserts directly into your Snowflake, BigQuery, or Redshift instance.
How do you handle nested arrays like product reviews in a star schema? +
Nested arrays represent a different grain of data. A product page scrape might yield one price (product grain) and fifty reviews (review grain). We split these into separate fact tables: fact_price_observation and fact_review, both linking back to the same dim_product and dim_date tables.
$ dataflirt scope --new-project --target=star-schema READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h