← Glossary / Query Optimization

What is Query Optimization?

Query optimization is the process of structuring database queries, indexes, and schemas to minimize the compute and I/O required to retrieve data. For scraping pipelines, where datasets often involve hundreds of millions of deeply nested JSON records, unoptimized queries don't just run slow — they lock tables, exhaust memory, and inflate cloud warehouse costs. It's the difference between a dashboard that loads in 200ms and one that times out after five minutes.

DatabasesExecution PlanIndexesJSONBCost Optimization
// 02 — definitions

Stop scanning
every row.

The mechanics of translating a SQL request into an execution plan that retrieves scraped data without reading the entire disk.

Ask a DataFlirt engineer →

TL;DR

Query optimization relies on the database's query planner to pick the cheapest path to your data. In scraping contexts, this usually means avoiding sequential scans on massive raw tables by using materialized views, partition pruning by scrape date, and specialized indexes (like GIN) on JSON payloads.

01Definition & structure

Query optimization is the process by which a database engine determines the most efficient way to execute a SQL statement. When you submit a query, the database's query planner generates multiple potential execution plans. It estimates the cost of each plan based on table statistics, available indexes, and hardware parameters, then executes the cheapest one.

For data engineers managing scraping pipelines, optimization involves designing schemas and indexes so the planner has good options to choose from. Without proper indexes, the planner is forced to use a sequential scan — reading every single row in the table from disk.

02How it works in practice

When you query a table of 50 million scraped products for a specific SKU, an unoptimized database reads all 50 million rows into memory to check the SKU field. An optimized database uses a B-tree index: it traverses a tree structure to find the exact disk location of that SKU in logarithmic time, reading only a handful of pages.

Optimization also involves rewriting queries. For example, replacing correlated subqueries with JOINs, or using window functions (like ROW_NUMBER()) instead of self-joins to deduplicate records based on the most recent scrape timestamp.

03The JSONB problem in scraping

Scraping pipelines frequently dump data into JSONB columns because target website schemas change unpredictably. While flexible for ingestion, JSONB is hostile to query performance if not handled correctly. Filtering on a nested JSON key requires the database to decompress and parse the JSON object for every row.

To optimize this, engineers use GIN (Generalized Inverted Index) indexes in PostgreSQL, which index the keys and values inside the JSON document, allowing the planner to find matching rows without parsing the entire table.

04How DataFlirt handles it

We design our delivery schemas to be query-optimized by default. Instead of delivering monolithic JSON blobs, our extraction layer promotes highly queried fields (domain, category, price, timestamp) to top-level, strictly typed columns. We partition delivery tables by scrape_date and cluster by target domain.

This means when your analysts query DataFlirt datasets in Snowflake or BigQuery, the query planner automatically prunes irrelevant partitions and utilizes columnar compression, keeping your compute costs low and dashboard latency under 500ms.

05Did you know?

Adding an index doesn't guarantee the query planner will use it. If the planner's statistics indicate that your query will return 40% or more of the table's rows, it will often ignore the index and perform a sequential scan anyway. This is because reading sequentially from disk is faster than the random I/O required to bounce back and forth between the index and the table heap for a massive number of rows.

// 03 — the cost model

How the planner
calculates cost.

Before executing a query, the database planner evaluates multiple execution paths and assigns a cost to each based on disk I/O and CPU cycles. It executes the path with the lowest score.

Total query cost (PostgreSQL) = C = (pages_read × seq_page_cost) + (cpu_tuples × cpu_tuple_cost)
Disk I/O is historically the bottleneck, though JSON parsing shifts this to CPU. PostgreSQL Cost Planner
Index selectivity = S = distinct_values / total_rows
S close to 1 means the index is highly selective and efficient to use. Database Theory
DataFlirt warehouse SLO = Tquery < 500ms
95th percentile latency for standard aggregations on 30-day trailing data. Internal SLO
// 04 — execution plan trace

EXPLAIN ANALYZE
on a scraped dataset.

Comparing the execution plan of a query filtering 12 million scraped product records by a JSON key, before and after applying a GIN index.

PostgreSQL 15EXPLAIN ANALYZEJSONB
edge.dataflirt.io — live
CAPTURED
-- BEFORE OPTIMIZATION: Filtering on raw JSONB
QUERY: SELECT url FROM scraped_items WHERE payload->>'status' = 'active';
Seq Scan on scraped_items (cost=0.00..345821.50 rows=1200000 width=45)
Filter: ((payload ->> 'status'::text) = 'active'::text)
Rows Removed by Filter: 10800000
Execution Time: 4152.33 ms

-- APPLYING OPTIMIZATION
ACTION: CREATE INDEX idx_status ON scraped_items USING GIN ((payload->'status'));

-- AFTER OPTIMIZATION
Bitmap Heap Scan on scraped_items (cost=124.50..18452.10 rows=1200000 width=45)
Recheck Cond: ((payload -> 'status'::text) = '"active"'::jsonb)
-> Bitmap Index Scan on idx_status (cost=0.00..121.20 rows=1200000 width=0)
Execution Time: 42.18 ms -- 98x faster
// 05 — performance killers

Where query time
actually goes.

The most common reasons queries on scraped datasets time out or consume excessive compute credits, ranked by frequency across our client warehouses.

DATASETS ANALYZED ·  ·    850+ schemas
QUERY VOLUME ·  ·  ·  ·   14M/day
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Sequential scans on large tables

I/O bound · Missing indexes force the DB to read every row on disk
02

Runtime JSON unpacking

CPU bound · Extracting nested text from JSONB on the fly
03

Spilling sorts to disk

Memory bound · ORDER BY exceeds work_mem, writing temp files
04

Cross-partition joins

I/O bound · Joining tables without utilizing partition keys
05

Stale table statistics

Planner error · ANALYZE hasn't run; planner chooses the wrong path
// 06 — schema architecture

Extract on write,

not on read.

DataFlirt's delivery layer doesn't just dump raw JSON into your warehouse. We extract high-cardinality fields — prices, SKUs, timestamps, and categories — into native columnar formats during the write phase. This allows your query planner to use standard B-tree indexes, columnar compression, and partition pruning, bypassing the massive CPU overhead of parsing JSON at query time. Compute is expensive; storage is cheap. We optimize for the former.

Query Optimization Profile

Metrics from a daily aggregation query running on a DataFlirt-managed Snowflake schema.

table.size 4.2 TB
partitions.scanned 1 of 365pruned
bytes.scanned 1.8 GBcolumnar
json.parsing 0 bytespre-extracted
spill.to_disk false
query.latency 312 ms
compute.cost $0.004 per run

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about indexing scraped data, handling JSON payloads, and keeping warehouse costs under control.

Ask us directly →
Why are queries on my scraped JSON data so slow? +
Because parsing JSON is a CPU-intensive operation. If you run a WHERE payload->>'price' > 100 query without an index, the database must load every row from disk, parse the JSON string into memory, extract the price, cast it to an integer, and then evaluate it. Do this on 50 million rows and your query will time out.
Should I just index every field in my scraped dataset? +
No. Indexes speed up reads but slow down writes (write amplification). Every time a scraper inserts a record, the database must update every index attached to that table. For high-throughput scraping pipelines, over-indexing will cause your ingestion queue to back up and lock the table.
What is partition pruning and why does it matter for scraping? +
Scraped data is inherently time-series data. If you partition your tables by scrape_date, a query for "yesterday's prices" allows the database planner to completely ignore the files containing the last three years of data. Partition pruning is the single most effective optimization for historical web data.
How do materialized views help with scraped data? +
Scraping often produces multiple records for the same entity over time. Instead of running a complex GROUP BY with window functions to find the "latest price" every time a dashboard loads, a materialized view pre-computes and stores that result. You query the view in milliseconds, and refresh it asynchronously in the background.
How does DataFlirt optimize data for BigQuery or Snowflake? +
We deliver data in columnar formats (Parquet or native warehouse inserts) rather than row-based JSON. We cluster tables by the fields you filter on most (e.g., domain, category) and partition by ingestion date. This ensures your queries only scan the exact micro-partitions they need, drastically reducing your compute bill.
Can I just throw more compute at slow queries? +
Yes, but it scales linearly in cost and sub-linearly in performance. Doubling your warehouse size doubles your cost but rarely halves your query time if the underlying issue is a missing index or a bad join. Optimization fixes the root cause; scaling compute just masks it temporarily.
$ dataflirt scope --new-project --target=query-optimization READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h