← Glossary / Materialized View

What is Materialized View?

Materialized view is a database object that contains the results of a query, physically stored on disk rather than computed on the fly. In scraping pipelines, they act as the critical bridge between raw, noisy extraction tables and clean, aggregated delivery datasets. By pre-computing complex joins and aggregations, they shift the compute cost from read-time to write-time, ensuring downstream consumers and APIs experience sub-second latency regardless of the underlying dataset size.

DatabasesQuery OptimizationPre-computationData EngineeringCaching
// 02 — definitions

Compute once,
read infinitely.

The physical caching layer that prevents your analytical queries from scanning terabytes of raw scraped HTML payloads every time a dashboard loads.

Ask a DataFlirt engineer →

TL;DR

A materialized view physically stores the result set of a complex query. Unlike standard views which run the underlying SQL every time they are queried, materialized views trade storage space and refresh latency for massive read-time performance gains. They are essential for serving aggregated scraping metrics to client dashboards without melting the database.

01Definition & structure
A materialized view is a database object that stores the physical result of a query. Unlike a standard view, which is essentially a macro that runs the underlying SQL on demand, a materialized view computes the data once and writes it to disk. This means querying a materialized view is as fast as querying a standard table, regardless of how complex the underlying joins, filters, or aggregations are.
02Standard views vs. Materialized views
If you have a dashboard showing the average price of 50,000 products over 365 days, a standard view will scan 18.2 million rows every time a user loads the page. A materialized view scans those 18.2 million rows once, saves the 50,000 averages to disk, and serves that pre-computed table to the dashboard. The trade-off is staleness: the materialized view only knows about data up to its last refresh point.
03Refresh strategies
Because the data is physically stored, it must be updated when the underlying base tables change.
  • Full Refresh: Drops the existing data and re-runs the entire query. Safe, but slow and resource-intensive.
  • Incremental (Fast) Refresh: Uses logs or CDC to identify exactly which rows changed in the base tables, and only updates those specific rows in the view. Highly efficient for large scraping datasets.
04How DataFlirt uses it
We use materialized views as the strict boundary between our scraping infrastructure and our client delivery APIs. Our crawler fleet writes raw, messy HTML extractions into ingestion tables. We then use dbt and incremental materialized views to clean, deduplicate, and aggregate that data. When a client queries our API, they are hitting the materialized view — ensuring sub-50ms response times and zero impact on the active crawl jobs.
05The locking trap (PostgreSQL)
A common failure mode in PostgreSQL is running REFRESH MATERIALIZED VIEW without the CONCURRENTLY keyword. A standard refresh takes an exclusive lock on the view, meaning any API or dashboard trying to read the data will hang until the refresh finishes. Using CONCURRENTLY allows reads to continue against the old data while the new data is built in the background, requiring a unique index but preventing pipeline downtime.
// 03 — the cost model

When to
materialize?

Materialization is a strict trade-off between storage costs and compute savings. DataFlirt's pipeline orchestrator automatically promotes standard views to materialized views when the read-to-write ratio crosses the threshold.

Query Speedup = S = Tstandard / Tmaterialized
Often 100x+ for heavy aggregations over scraped datasets. Database Optimization Basics
Break-even Read Rate = R > (Cstorage + Crefresh) / Ccompute_per_read
If you read the data more often than this threshold, materialize it. DataFlirt infrastructure heuristics
Maximum Data Staleness = Δt = Trefresh_interval + Trefresh_duration
The maximum age of the data before the next refresh completes. SLA definition
// 04 — execution trace

Refreshing a 50M
record aggregation.

A trace of an incremental refresh on a materialized view tracking daily price changes across an e-commerce dataset. Only the changed rows are computed and merged.

PostgreSQLpg_cronincremental refresh
edge.dataflirt.io — live
CAPTURED
-- Triggering scheduled refresh job
job.id: "refresh_mv_daily_price_variance"
strategy: "CONCURRENTLY" // non-blocking read

-- Identifying delta via CDC
cdc.scanned_rows: 142,850 -- new scrapes since last run
cdc.matched_products: 89,102

-- Executing aggregation
query.plan: "Hash Join -> HashAggregate"
query.execution_time: 14.2s

-- Merging into materialized view
rows.inserted: 1,204
rows.updated: 87,898
rows.deleted: 0

-- Maintenance
vacuum.status: complete
refresh.total_time: 18.5s
view.status: ONLINE
// 05 — failure modes

Where materialized
views break down.

Ranked by frequency of incidents in analytical pipelines. The most common failures stem from misunderstanding the refresh lifecycle and locking behavior.

PIPELINES MONITORED ·   300+ active
REFRESH CADENCE ·  ·  ·   15m median
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Refresh lock contention

% of incidents · Standard refreshes block reads; requires CONCURRENTLY
02

Stale data delivery

% of incidents · Refresh job silently fails or times out
03

Storage bloat

% of incidents · Unused views consuming expensive SSD space
04

Cascading refresh failures

% of incidents · Dependency chain breaks when base table alters
05

Incremental logic drift

% of incidents · CDC misses edge cases, requiring full rebuilds
// 06 — DataFlirt's data layer

Aggregated at the edge,

delivered without the compute tax.

We don't expose raw extraction tables to client APIs. Every DataFlirt delivery endpoint is backed by a materialized view that is incrementally refreshed via CDC from the raw ingestion layer. This isolates our analytical workloads from our scraping workloads. Your API requests never compete for resources with our crawler fleet.

mv_daily_price_variance

Live status of a client-facing materialized view in our Snowflake cluster.

schema.base_tables raw_scrapes, dim_products
refresh.strategy incremental · 15m
last_refresh 2 mins ago
rows.total 48,291,004
storage.size 4.2 GB
query.p99_latency 42ms
status serving traffic

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About view materialization, refresh strategies, data staleness, and how DataFlirt serves high-volume datasets.

Ask us directly →
What is the difference between a view and a materialized view? +
A standard view is just a saved SQL query — every time you query it, the database executes the underlying logic against the base tables. A materialized view executes the query once and saves the actual result set to disk. Views cost compute on every read; materialized views cost storage and require periodic refreshing.
When should I NOT use a materialized view? +
Don't use them for highly volatile data that requires strict real-time consistency (e.g., live inventory checks during checkout). If the business requirement dictates that the data cannot be even 1 minute stale, a materialized view will fail you. They are designed for analytical workloads, dashboards, and batch data delivery.
How do you handle GDPR/CCPA deletions in materialized views? +
When a deletion request hits the base table, the materialized view must reflect it. If the view is fully refreshed, it happens automatically on the next run. For incrementally refreshed views, the CDC (Change Data Capture) pipeline must explicitly capture the DELETE operation and propagate it to the materialized view. We enforce this at the pipeline orchestration layer.
What is an incremental refresh? +
Instead of dropping the entire view and re-computing millions of rows from scratch (a full refresh), an incremental refresh only computes the delta — the rows that have changed in the base tables since the last refresh — and merges them into the view. It reduces refresh time from minutes to milliseconds.
How does DataFlirt ensure data isn't stale? +
We decouple ingestion from delivery. Scrapers write to raw tables continuously. A background orchestrator triggers concurrent refreshes on the materialized views based on the client's SLA (e.g., every 15 minutes). Monitoring alerts us if a refresh job exceeds its time budget, ensuring clients never pull data older than the agreed threshold.
Can materialized views scale to billions of scraped records? +
Yes, and they are the only way to make billions of records queryable. In modern cloud data warehouses like Snowflake or BigQuery, materialized views over massive datasets are clustered and partitioned. Querying a 10-billion row raw table might take 40 seconds and cost $5; querying the aggregated materialized view takes 200ms and costs fractions of a cent.
$ dataflirt scope --new-project --target=materialized-view READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h