← Glossary / BigQuery

What is BigQuery?

BigQuery is Google Cloud's fully managed, serverless enterprise data warehouse designed to query petabytes of data in seconds. For scraping pipelines, it serves as the ultimate analytical sink — where raw, nested JSON payloads from the web are flattened, deduplicated, and transformed into structured tables. Because compute and storage are decoupled, you can dump terabytes of daily scrape logs into it for pennies, only paying for the compute when your downstream analysts actually run their aggregations.

Data WarehouseServerlessColumnar StorageGCPOLAP
// 02 — definitions

Petabytes in,
seconds out.

The mechanics of how a serverless columnar database ingests massive web scraping payloads and makes them instantly queryable.

Ask a DataFlirt engineer →

TL;DR

BigQuery separates storage from compute, meaning you can store years of raw scraped HTML and JSON at roughly the cost of standard cloud storage, while leveraging Google's massive distributed infrastructure to run complex analytical queries across billions of rows in seconds. It's the default destination for high-volume data pipelines.

01Definition & structure
BigQuery is a serverless, highly scalable data warehouse. Under the hood, it relies on two core technologies: Colossus (Google's distributed file system) for storage, and Dremel (a distributed query engine) for compute. Data is stored in a proprietary columnar format called Capacitor. This separation of storage and compute means you can store massive amounts of scraped data cheaply, and Google dynamically allocates thousands of CPUs to execute your queries only when you run them.
02How it works in practice
In a scraping context, data usually lands in a cloud storage bucket (like GCS) as JSONL or Parquet files. A load job is then triggered to ingest these files into BigQuery. Once loaded, data engineers and analysts use standard SQL to query the data. Because it's columnar, a query that calculates the average price of a product across 10 billion rows will only read the "price" column from disk, completing in seconds and costing a fraction of a full table scan.
03Handling nested scraped data
Web data is inherently hierarchical (e.g., a real estate listing with multiple images, amenities, and historical price changes). Traditional relational databases force you to normalize this into multiple tables. BigQuery supports REPEATED and RECORD fields, allowing you to store the entire nested JSON structure in a single row. You can then use the UNNEST() function in SQL to flatten the arrays on the fly at query time.
04How DataFlirt handles it
We treat BigQuery as a first-class delivery destination. Instead of handing you raw JSON files, our pipelines can write directly to your GCP project. We handle the schema mapping, enforce strict data typing, and configure table partitioning by ingestion date. For high-frequency pipelines, we use micro-batching via GCS to ensure data freshness without incurring the heavy costs of the streaming insert API.
05Did you know?
The most expensive mistake you can make in BigQuery is running SELECT * on a large table just to preview the data. Because BigQuery charges by bytes scanned, a SELECT * LIMIT 10 on a 5 TB table will still scan the entire 5 TB, costing you roughly $30 for a single query. Always use the free "Preview" tab in the console, or specify only the columns you need.
// 03 — the cost model

How BigQuery
bills your pipeline.

BigQuery pricing is notoriously spiky if you don't understand the underlying architecture. DataFlirt optimizes delivery to minimize both storage and query costs for our clients.

On-demand query cost = C = (bytes_scanned / 1012) × $6.25
Columnar storage means selecting fewer columns drastically reduces cost. GCP Pricing, 2026
Active storage cost = S = GB_stored × $0.02
Drops by 50% automatically if data isn't modified for 90 days. GCP Pricing, 2026
Streaming insert cost = I = (MB_streamed / 1024) × $0.05
Batch loading via GCS is free; streaming costs money but offers real-time availability. GCP Pricing, 2026
// 04 — pipeline delivery

Loading 10M records
into the warehouse.

A trace of DataFlirt's delivery worker executing a batch load job from a Google Cloud Storage bucket into a partitioned BigQuery table.

bq loadJSONLpartitioned
edge.dataflirt.io — live
CAPTURED
// init load job
job.id: "df_load_20260519_88a1"
source.uri: "gs://df-client-042/raw/2026-05-19/*.jsonl"
target.table: "client_042.raw_scrapes.product_listings"

// schema validation
schema.autodetect: false
schema.provided: "gs://df-client-042/schemas/v7.json"
validation: ok

// execution
state: RUNNING
bytes_processed: 4.2 GB
rows_inserted: 10,421,099
bad_records: 0

// completion
state: DONE
duration: 14.2s
cost: $0.00 // batch loads are free
// 05 — cost drivers

Where the budget
actually leaks.

Ranked by share of unexpected BigQuery costs across client data warehouses. Unoptimized queries and poor partitioning are the primary culprits.

PIPELINES DELIVERED ·   150+ to BQ
AVG TABLE SIZE ·  ·  ·    4.2 TB
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

SELECT * queries

100% scan · scans every column, negating columnar benefits
02

Missing partition filters

full table scan · querying without date bounds on time-series data
03

Streaming inserts

$0.05 / GB · using streaming API when batch loading would suffice
04

Unnested JSON parsing

CPU heavy · parsing JSON strings at query time instead of load time
05

Active storage bloat

$0.02 / GB · failing to set table expiration policies for raw logs
// 06 — delivery architecture

Partitioned by default,

clustered for performance.

When DataFlirt delivers scraped datasets into a client's BigQuery environment, we don't just dump flat files. We structure the destination tables to minimize your downstream query costs. Every table is partitioned by ingestion date and clustered by high-cardinality lookup keys like domain or category. This ensures that when your analysts query yesterday's pricing data, they only scan yesterday's partition — cutting query costs by up to 99% compared to a naive flat-table setup.

BQ Table Configuration

Standard DDL parameters for a DataFlirt delivery table.

table.name raw_pricing_events
partition_by DATE(ingested_at)
cluster_by [domain, category]
require_partition_filter true
expiration_days 730
nested_fields attributes (STRUCT)
delivery.method Batch Load (GCS)

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About BigQuery architecture, cost optimization, schema management, and how DataFlirt delivers scraped data at scale.

Ask us directly →
What's the difference between BigQuery and PostgreSQL? +
PostgreSQL is an OLTP (Online Transaction Processing) database designed for fast row-level inserts and updates. BigQuery is an OLAP (Online Analytical Processing) data warehouse designed for massive aggregations across billions of rows. You shouldn't use BigQuery to back a live web app, and you shouldn't use Postgres to run analytics on 5 years of scraped pricing data.
Should I store raw scraped HTML in BigQuery? +
Yes, because storage is cheap ($0.02/GB). However, you must store the raw HTML in a separate column — or ideally a separate table — from your extracted structured data. Because BigQuery is columnar, as long as your analysts don't include the HTML column in their SELECT statements, storing it costs you nothing in query fees.
How does DataFlirt handle schema drift in BigQuery? +
We use explicit schema evolution. When a target site adds a new field, we update our extraction schema and append the new column to your BigQuery table as a nullable field. We never use BigQuery's autodetect feature in production, as it can lead to catastrophic type coercion failures (e.g., inferring an integer for a column that later receives a string).
Is the streaming insert API better than batch loading? +
Only if you need sub-minute latency. For 95% of scraping pipelines, 15-minute micro-batches via Google Cloud Storage are superior. Batch loads are completely free, whereas the streaming API charges $0.05 per GB. At scale, streaming inserts become a massive, unnecessary line item on your GCP bill.
How do I avoid massive unexpected query bills? +
Two rules: First, enforce require_partition_filter = true on all large tables so analysts are forced to restrict their queries to specific date ranges. Second, never run SELECT * in production. Specify only the columns you actually need to aggregate.
Can BigQuery handle highly nested scraped data? +
Exceptionally well. BigQuery has native support for ARRAY and STRUCT data types. This means you can load complex, nested JSON payloads (like a product with an array of reviews, each containing a struct of author details) directly into the database without having to flatten it into multiple relational tables first.
$ dataflirt scope --new-project --target=bigquery READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h