← Glossary / Apache Parquet

What is Apache Parquet?

Apache Parquet is an open-source, column-oriented data file format designed for highly efficient data storage and retrieval. Unlike row-based formats like CSV or JSON, Parquet stores data by column, enabling aggressive compression and predicate pushdown. For scraping pipelines delivering millions of records, it's the difference between a 50 GB daily payload that chokes a downstream warehouse and a 4 GB file that queries in milliseconds.

Data EngineeringColumnar StorageCompressionPredicate PushdownS3 Delivery
// 02 — definitions

Columns over
rows.

Why writing data the way an analytical database reads it fundamentally changes pipeline economics and query speed.

Ask a DataFlirt engineer →

TL;DR

Apache Parquet organizes data by column rather than by row. This allows identical data types to be stored adjacently, enabling compression ratios up to 10x better than CSV. It includes embedded schema and metadata, allowing analytical engines like Snowflake or BigQuery to skip irrelevant chunks entirely during reads.

01Definition & structure
Apache Parquet is a binary, columnar storage format. In a row-based format like CSV, all data for a single record is stored contiguously. In Parquet, all values for a specific column are stored together. The file is divided into Row Groups, and within each Row Group, the data is stored column by column. This structure allows analytical queries to read only the specific columns they need, drastically reducing disk I/O.
02Compression mechanics
Because Parquet stores identical data types adjacently, it can apply highly specialized compression techniques. It uses Dictionary Encoding to replace repeated strings with small integers, and Run-Length Encoding (RLE) to compress sequences of identical values. On top of this, it applies block-level compression algorithms like Snappy or ZSTD. The result is a file that is often 10% the size of the equivalent uncompressed CSV.
03Predicate pushdown
Parquet files contain embedded metadata at the file, row group, and page levels. This metadata includes the minimum and maximum values for every column chunk. When a query engine executes a filter (e.g., WHERE price > 100), it reads the metadata first. If a row group's max price is 50, the engine skips reading that entire chunk of data from disk. This is called predicate pushdown, and it is the key to millisecond queries on terabyte-scale datasets.
04How DataFlirt handles it
We treat Parquet as the default delivery format for all enterprise pipelines. Our extraction workers serialize data directly to Parquet using Rust-based Arrow implementations, enforcing strict schema validation before a single byte hits your S3 bucket. We optimize row group sizes (typically 128MB to 256MB) to balance memory consumption during writing with optimal read performance in Snowflake and BigQuery.
05The schema evolution trap
Unlike JSON, Parquet is strictly typed. If a target website suddenly changes a price field from a float to a string (e.g., "Call for price"), a naive scraper will crash the Parquet writer or corrupt the downstream table. Handling schema evolution requires explicit rules: either casting the string to null, quarantining the record, or utilizing a table format like Apache Iceberg on top of the Parquet files to manage the schema transition gracefully.
// 03 — storage economics

How much space
does it save?

Columnar formats exploit data redundancy. The math below dictates how DataFlirt calculates storage and egress costs for enterprise data feeds, and why we default to Parquet for high-volume pipelines.

Compression Ratio = Cr = SizeCSV / SizeParquet
Typically 4:1 to 10:1 depending on column cardinality and data types. Storage baseline
Query I/O Cost = Cost = BytesScanned × RateCloud
Parquet minimizes BytesScanned via column projection and row group skipping. BigQuery / Athena pricing models
DataFlirt Egress Efficiency = E = (Records × AvgRowBytes) / ParquetFileSize
E > 8.5 across our e-commerce catalog pipelines as of v2026.5. Internal SLO
// 04 — file inspection

Inside a 10M record
Parquet payload.

Using parquet-tools to inspect a daily delivery of scraped product data before it hits the client's Snowflake staging bucket. Notice the embedded schema and block-level metadata.

parquet-toolssnappy compressionschema metadata
edge.dataflirt.io — live
CAPTURED
$ parquet-tools meta s3://df-client-042/catalog/2026-05-19/part-000.parquet
file schema: schema
creator: parquet-rs version 51.0.0 (build 2026)

// Schema definition embedded in the file
message schema {
OPTIONAL BYTE_ARRAY product_id (STRING);
OPTIONAL DOUBLE price_usd;
OPTIONAL BOOLEAN in_stock;
OPTIONAL BYTE_ARRAY category (STRING);
}

// Row group metadata enables predicate pushdown
row group 1: RC:1000000 TS:45812000 OFFSET:4
--------------------------------------------------------------------------------
price_usd: DOUBLE SNAPPY DO:0 FPO:4 SZ:800000/1200000/1.50 VC:1000000
min: 4.99, max: 2499.00, nulls: 12 // stats used for skipping
category: BYTE_ARRAY SNAPPY DO:0 FPO:800004 SZ:150000/4000000/26.66 VC:1000000
min: "Apparel", max: "Tools", nulls: 0

// Compression efficiency
total_uncompressed: 4.2 GB
total_compressed: 412 MB // 10.1x ratio
// 05 — performance drivers

Where the speed
actually comes from.

The architectural features of Parquet that reduce I/O and compute costs when querying scraped datasets. Ranked by impact on downstream warehouse performance.

FORMAT ·  ·  ·  ·  ·  ·   Binary Columnar
COMPRESSION ·  ·  ·  ·    Snappy / ZSTD
READ PATTERN ·  ·  ·  ·   Sequential I/O
01

Column Projection

I/O reduction · Only read the specific columns requested in the SELECT clause
02

Predicate Pushdown

Compute savings · Skip entire row groups using embedded min/max statistics
03

Dictionary Encoding

Storage efficiency · Compresses repeated strings (e.g., categories, brands) into integers
04

Run-Length Encoding

Storage efficiency · Compresses sequential identical values into a single value and count
05

Block-Level Compression

Network speed · Snappy or ZSTD applied to homogeneous column chunks
// 06 — our stack

Typed at the edge,

queried in the warehouse.

Writing Parquet isn't just about saving disk space; it's about enforcing a data contract. Because Parquet requires a strict schema, type coercion failures are caught at the extraction layer, not in your downstream dbt models. DataFlirt writes Parquet natively from our extraction workers, partitioning by date and target domain, so your data engineering team spends zero compute cycles parsing JSON strings.

Parquet Write Job

Live metrics from an extraction worker serializing a batch of scraped records.

job.id pq-writer-099
schema.version v4.2enforced
records.input 2,500,000
type_errors 0
codec ZSTD
file.size 184 MBratio: 8.4x
s3.upload success

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About columnar storage, schema evolution, compression codecs, and how DataFlirt delivers production-grade Parquet files.

Ask us directly →
Why should I use Parquet instead of JSON or CSV for scraped data? +
If you are loading data into a warehouse (Snowflake, BigQuery, Redshift) or querying it with Athena/Presto, Parquet is vastly superior. It enforces data types, compresses up to 10x smaller, and allows the query engine to scan only the columns you ask for. JSON and CSV require full-file scans and expensive string parsing on every query.
Can I append new scraped records to an existing Parquet file? +
No. Parquet files are immutable. Because metadata and dictionary statistics are written at the end of the file, you cannot simply append rows. The standard pattern is to write new data as new Parquet files in a partitioned directory structure (e.g., /year=2026/month=05/) and let the query engine read the directory as a single table.
How does DataFlirt handle schema drift when writing Parquet? +
Schema drift is the enemy of Parquet. If a target site adds a new field, we don't blindly write it and break your downstream tables. Our extraction layer validates against a versioned schema contract. New fields are quarantined or mapped to an overflow JSONB column until the schema contract is explicitly bumped and communicated to your data team.
Which compression codec should I use: Snappy, GZIP, or ZSTD? +
ZSTD is the modern default. It offers compression ratios close to GZIP but with decompression speeds rivaling Snappy. DataFlirt defaults to ZSTD level 3 for all Parquet deliveries unless a client specifically requests Snappy for legacy Hadoop compatibility.
Is Parquet human-readable? +
No, it is a binary format. You cannot open it in a text editor or Excel. To inspect a Parquet file, you need tools like parquet-tools, Python's pandas/pyarrow, or a local analytical engine like DuckDB.
How does partitioning work with Parquet deliveries? +
Partitioning organizes Parquet files into a directory hierarchy based on column values (e.g., date, country, category). When you query WHERE country = 'US', the engine skips the 'UK' and 'IN' directories entirely. DataFlirt works with your data team to define the optimal partition keys based on your most frequent query patterns.
$ dataflirt scope --new-project --target=apache-parquet READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h