← Glossary / Data Compression

What is Data Compression?

Data compression is the algorithmic reduction of dataset size before storage or network transit. In scraping pipelines, where daily yields routinely exceed terabytes of raw JSON or HTML, compression isn't just a storage optimization—it's the primary lever for controlling egress costs and accelerating downstream ingestion. Choosing the right codec dictates whether your data warehouse spends its compute cycles parsing text or querying structured blocks.

Data EngineeringStorage CostZstd / GzipEgressParquet
// 02 — definitions

Shrink the
payload.

How encoding algorithms trade CPU cycles for network bandwidth, and why raw JSON is the enemy of scale.

Ask a DataFlirt engineer →

TL;DR

Data compression reduces the byte footprint of scraped records using algorithms like Zstandard, Gzip, or Snappy. For high-volume pipelines, compressing JSON into columnar formats like Parquet with Zstd encoding cuts storage by up to 90% and slashes cloud egress fees, making it mandatory for production data delivery.

01Definition & structure
Data compression involves encoding information using fewer bits than the original representation. In data engineering, this is typically lossless compression—meaning the exact original data can be perfectly reconstructed. It relies on finding and eliminating statistical redundancy in the dataset. Common codecs include:
  • Zstandard (Zstd) — Modern, fast, high ratio.
  • Gzip — The ubiquitous legacy standard.
  • Snappy — Extremely fast, but lower compression ratio.
  • Brotli — Optimized for web text and streams.
02How it works in practice
In a scraping pipeline, data is extracted into memory as structured objects, serialized to a format like JSON or CSV, and then passed through a compression algorithm before being written to disk or sent over the network. The algorithm builds a dictionary of repeated byte sequences (like the string "price_currency":"USD") and replaces them with short pointers. The receiver reverses the process using the same algorithm.
03The columnar advantage
Compression algorithms thrive on uniformity. Row-based formats like JSON interleave different data types (strings, numbers, booleans), which limits compression efficiency. Columnar formats like Parquet group identical data types together. A column of 10,000 boolean true values can be compressed to just a few bytes using run-length encoding before a general-purpose codec like Zstd is even applied.
04How DataFlirt handles it
We treat compression as a core component of the delivery contract. By default, all bulk data deliveries are shipped as Zstandard-compressed Parquet files. For clients requiring NDJSON or CSV, we enforce Gzip or Zstd compression at the bucket level. This standard reduces our AWS egress costs and ensures your ingestion pipelines aren't bottlenecked by network I/O.
05The micro-batch problem
Standard compression is inefficient on very small files (under 100KB) because the algorithm doesn't have enough data to build a good dictionary. If you are streaming scraped records individually, compressing each payload independently yields almost no benefit. The solution is either micro-batching (grouping records into larger chunks before compressing) or using pre-trained dictionaries.
// 03 — the math

How much does
it save?

Compression ratio is the baseline metric, but DataFlirt optimizes for the total cost of delivery—balancing compute time to compress against the network cost to egress.

Compression Ratio = Uncompressed Size / Compressed Size
Higher is better. JSON to Zstd Parquet often hits 8:1. Standard metric
Space Savings = 1 − (Compressed Size / Uncompressed Size)
Percentage of storage and transit volume eliminated. Standard metric
Egress Cost = E = (Records × Bytes/Record) / Cr × $0.09/GB
The actual dollar impact of your codec choice on AWS/GCP. DataFlirt delivery model
// 04 — pipeline delivery trace

Compressing 10M
records for S3.

A live trace of a daily catalog scrape being transformed from raw NDJSON into Zstandard-compressed Parquet before client delivery.

ZstandardParquetS3 Egress
edge.dataflirt.io — live
CAPTURED
// input stage
job.id: "export-catalog-IN-042"
format.in: "ndjson"
size.raw: 14.2 GB
records: 10,485,760

// compression phase
codec: "zstd"
level: 3 // balanced cpu/ratio
format.out: "parquet"
dict.training: ok // 128MB sample used

// output metrics
size.compressed: 1.8 GB
ratio: 7.88x
savings: 87.3%
time.cpu: 42.1s

// delivery
s3.upload: ok // s3://df-client-042/catalog/2026-05-19/
egress.cost: $0.16 // vs $1.27 uncompressed
// 05 — codec selection

Where the CPU
cycles go.

Different compression algorithms optimize for different constraints. Here is how the most common codecs rank by adoption across DataFlirt's enterprise delivery pipelines.

PIPELINES ·  ·  ·  ·  ·   400+ active
WINDOW ·  ·  ·  ·  ·  ·   30d trailing
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Zstandard (Zstd)

Default choice · High ratio, fast decompression
02

Gzip

Legacy standard · Universal support, slow compression
03

Snappy

Speed optimized · Low CPU overhead, poor ratio
04

Brotli

Stream optimized · Fast, moderate ratio
05

LZ4

Database internal · Extremely fast decompression
// 06 — delivery architecture

Don't ship text,

ship compressed columns.

Sending gigabytes of raw JSON over the wire is an amateur mistake that punishes both the sender and the receiver. DataFlirt defaults to delivering data in Zstandard-compressed Parquet. By organizing scraped data into columns before compressing, the algorithm can exploit type similarity—compressing a column of identical currency strings down to almost nothing. This reduces our egress costs and allows your data warehouse to query the files directly without expensive ETL unpacking.

Delivery Job Profile

Metrics from a high-volume real estate pipeline delivery.

pipeline.id re-listings-us
records.count 4.2M
raw.size 8.4 GB
compressed.size 940 MB
codec zstd_level_3
format parquet
egress.savings $0.67 per run

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About compression algorithms, columnar formats, dictionary training, and how DataFlirt optimizes data delivery.

Ask us directly →
Why use Zstandard instead of Gzip? +
Zstandard (Zstd) offers a Pareto improvement over Gzip. At the same compression ratio, it compresses faster and decompresses significantly faster. For data pipelines, this means lower CPU costs on our end and faster ingestion times on your end. Gzip is only used when legacy downstream systems demand it.
Does compression slow down the scraping pipeline? +
No. Compression happens asynchronously during the delivery phase, not during the fetch or extraction phases. The CPU time spent compressing the payload is almost always less than the network time saved by transmitting a smaller file.
Why combine Parquet with compression? +
Parquet is a columnar format. When you compress data column by column, the compression algorithm sees highly uniform data (e.g., a column of boolean flags or repeated category strings). This uniformity allows dictionary encoders to achieve massive compression ratios—often 80-90% smaller than compressed row-based JSON.
What is dictionary training in compression? +
For small, highly repetitive payloads, standard compression algorithms struggle because they build their reference dictionary on the fly. Dictionary training pre-computes a shared dictionary based on a sample of your scraped data. DataFlirt uses this for high-frequency micro-batch deliveries to maintain high compression ratios even on 10KB payloads.
Do I need to decompress the data before loading it into Snowflake or BigQuery? +
Usually, no. Modern cloud data warehouses natively support ingesting compressed files (like Gzip CSVs or Zstd Parquet). They decompress the data on the fly during the COPY INTO or external table query process, saving you an intermediate ETL step.
How does DataFlirt handle compression for real-time webhook delivery? +
For streaming or webhook deliveries, we use HTTP payload compression (typically Content-Encoding: gzip or br). The client must send an Accept-Encoding header indicating support. This reduces transit latency for large JSON payloads without requiring specialized file formats.
$ dataflirt scope --new-project --target=data-compression READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h