← Glossary / Columnar Storage

What is Columnar Storage?

Columnar storage is a database architecture and file format paradigm where data is stored physically by column rather than by row. For scraping pipelines delivering wide datasets — like e-commerce catalogs with 80+ attributes — columnar formats like Parquet drastically reduce storage costs and accelerate analytical queries by allowing downstream engines to read only the specific fields they need, skipping the rest entirely.

Data EngineeringParquetOLAPCompressionData Lake
// 02 — definitions

Read what
you need.

Why storing data by column is the only sane way to handle the massive, wide datasets generated by modern scraping pipelines.

Ask a DataFlirt engineer →

TL;DR

Columnar storage writes all values of a single column contiguously on disk. This enables massive compression ratios because data in a single column is highly uniform. It also allows analytical engines like Snowflake or BigQuery to perform projection pushdown — reading only the columns requested in a query and ignoring the rest, cutting I/O costs by up to 90%.

01Definition & structure

Columnar storage organizes data on disk by column rather than by row. In a row-based system, reading the price of 10,000 products requires loading the entire row for all 10,000 products into memory. In a columnar system, the price values are stored contiguously, allowing the disk to read only the requested data.

This architecture is the foundation of modern analytical databases (OLAP) and data lakes. It is specifically designed for workloads that scan massive numbers of rows but only care about a subset of columns.

02How it works in practice

When a scraping pipeline delivers a dataset in a columnar format like Parquet, the file is divided into "row groups" (e.g., 100,000 records per group). Within each row group, the data is pivoted: all values for column A are written, followed by all values for column B.

The file footer contains metadata detailing the exact byte offsets for each column chunk, along with statistics like min/max values and null counts. When a query engine like Athena or BigQuery reads the file, it fetches the footer first, determines exactly which byte ranges contain the requested columns, and issues targeted HTTP GET requests to S3 to fetch only those bytes.

03Compression and encoding

Because columnar storage groups identical data types together, it achieves extreme compression. A column of boolean values (in_stock) can be stored as a bitset. A column of repeated strings (category) can use dictionary encoding, storing the string "Electronics" once and referencing it with a 1-byte integer.

This type-specific encoding, combined with general-purpose compression algorithms like Snappy or Zstandard, routinely reduces the physical footprint of scraped data by 70–90% compared to raw JSON.

04How DataFlirt handles it

We treat columnar delivery as a first-class requirement for enterprise pipelines. While our extraction workers process data row-by-row, our delivery layer buffers these records and writes them to S3 or GCS as strongly-typed Parquet files.

We enforce schema contracts before writing, ensuring that a column defined as an integer never accidentally contains a string, which would break columnar encoding. This guarantees that when our clients point their data warehouse at our delivery buckets, the ingestion is fast, cheap, and error-free.

05The JSON trap

Many data teams start by dumping scraped JSON directly into a data lake. This is the "JSON trap." Because JSON is row-based and schema-on-read, every analytical query forces the engine to scan the entire dataset, parse the text, and extract the relevant fields on the fly.

As the dataset grows to terabytes, query costs skyrocket and performance plummets. Converting the pipeline delivery format from JSON to a columnar format is usually the single highest-ROI optimization a data engineering team can make.

// 03 — the efficiency math

Why columnar
beats row-based.

The fundamental math behind why DataFlirt delivers all bulk datasets in Parquet by default. The I/O savings scale linearly with the width of the scraped schema.

I/O Cost (Columnar) = I/O = Rows × Σ Size(Requested_Columns)
Row-based reads the whole row; columnar reads only requested columns. Database Systems Architecture
Compression Ratio = C = Sizeraw / Sizecompressed
Columnar achieves 5x-10x better compression due to type uniformity. Apache Parquet benchmarks
DataFlirt Delivery Efficiency = E = 1 − (Parquet_Bytes / JSON_Bytes)
Typical payload reduction when moving from JSON lines to Parquet. DataFlirt Pipeline Metrics
// 04 — file structure trace

Inside a Parquet
row group.

Inspecting a scraped product catalog delivered as a Parquet file. Notice how the schema is embedded and data is chunked by column.

parquet-toolssnappy compressiondictionary encoding
edge.dataflirt.io — live
CAPTURED
// inspect metadata
$ parquet-tools meta s3://df-client-042/catalog_20260519.parquet
file schema: schema
----------------------------------------------------------------
sku: OPTIONAL BINARY O:UTF8 R:0 D:1
price: OPTIONAL DOUBLE R:0 D:1
in_stock: OPTIONAL BOOLEAN R:0 D:1
category: OPTIONAL BINARY O:UTF8 R:0 D:1

row group 1: RC:100000 TS:4820194 OFFSET:4
----------------------------------------------------------------
sku: SNAPPY DO:0 FPO:4 SZ:840212/1204112 VC:100000
price: SNAPPY DO:0 FPO:840216 SZ:400124/800000 VC:100000
in_stock: SNAPPY DO:0 FPO:1240340 SZ:12514/12500 VC:100000
category: SNAPPY DO:0 FPO:1252854 SZ:4512/840000 VC:100000 // DICT encoded

// Query: SELECT AVG(price) FROM catalog WHERE category = 'Electronics'
io.bytes_read: 404,636 // Only read price and category columns
io.bytes_skipped: 852,726 // Ignored sku and in_stock
// 05 — performance drivers

Where the speed
comes from.

The architectural features of columnar formats that make them mandatory for analytical workloads on scraped data.

FORMAT ·  ·  ·  ·  ·  ·   Apache Parquet
COMPRESSION ·  ·  ·  ·    Snappy / Zstd
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Projection Pushdown

I/O reduction · Skip reading unrequested columns entirely
02

Dictionary Encoding

compression · Map repeated strings to small integers
03

Run-Length Encoding

compression · Store repeated values as (value, count)
04

Predicate Pushdown

I/O reduction · Skip row groups based on min/max stats
05

Vectorized Processing

CPU efficiency · Process batches of column data in CPU cache
// 06 — delivery architecture

JSON is for debugging,

Parquet is for production.

While JSON is human-readable and great for API responses, it is a disaster for analytical storage. It repeats schema keys for every record and forces full-scan reads. DataFlirt's delivery layer automatically transforms nested JSON extraction outputs into flattened, strongly-typed Parquet files. This reduces our clients' AWS S3 storage costs and makes their Snowflake ingestion pipelines significantly faster.

Delivery Format Comparison

Metrics from a 5M record real-estate scrape delivered to S3.

format JSON LinesParquet
file.size 4.2 GB680 MB
schema.type ImplicitEmbedded
query.scan_price 4.2 GB scanned45 MB scanned
ingestion.time 14m 20s1m 45s
type.safety NoneStrict

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About columnar formats, analytical workloads, and how DataFlirt structures scraped data for delivery.

Ask us directly →
What is the difference between row-oriented and columnar storage? +
Row-oriented databases (like PostgreSQL or MySQL) store all fields of a single record together on disk. This is great for writing new records or fetching a single complete profile (OLTP). Columnar storage (like Parquet or Snowflake) stores all values of a single field together. This is vastly superior for analytics (OLAP) where you want to aggregate a few columns across millions of records without reading the entire dataset.
Why not just use CSV or JSON for scraped data? +
CSV and JSON are row-based text formats. They don't support projection pushdown (you have to read the whole file to get one column), they lack strict typing, and they compress poorly compared to binary formats. They are fine for small datasets or debugging, but at scale, they waste compute, I/O, and storage money.
When should I NOT use columnar storage? +
Don't use it for transactional systems where you frequently update or delete single records. Columnar formats are designed for append-only or batch-update workloads. If your application needs to constantly update the price of a single SKU in real-time, a row-based database is better.
How does DataFlirt handle nested scraped data in Parquet? +
Scraped data is often deeply nested (e.g., a product with an array of reviews). Parquet supports nested data structures natively. However, for maximum compatibility with downstream data warehouses, DataFlirt's delivery layer can optionally flatten nested arrays into separate relational tables or stringified JSONB columns depending on your ingestion requirements.
What is predicate pushdown? +
It's a performance optimization where the query engine uses metadata stored in the file (like min/max values for a column in a specific chunk) to skip reading data that can't possibly match the query filter. If you query WHERE price > 100 and a chunk's metadata says its max price is 50, the engine skips that chunk entirely.
Does columnar storage affect how I write my scraping extractors? +
No. Data extraction is inherently a row-based operation — you parse a single HTML page and extract a single record. The transformation to columnar storage happens at the delivery layer. DataFlirt buffers the extracted row-based records in memory or Kafka, and flushes them to S3 as columnar Parquet files in batches.
$ dataflirt scope --new-project --target=columnar-storage READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h