← Glossary / Data Compaction

What is Data Compaction?

Data compaction is the background process of merging thousands of small, incrementally written files into fewer, larger files optimized for read performance. In scraping pipelines that stream data continuously, the "small file problem" destroys query speed and inflates cloud storage API costs. Compaction reorganizes these fragmented writes into contiguous columnar blocks, trading background compute for massive downstream read efficiency.

StorageParquetSmall File ProblemLSM TreesData Engineering
// 02 — definitions

Fixing the small
file problem.

Why streaming scraped data directly into a data lake creates a fragmented mess, and how compaction restores query performance.

Ask a DataFlirt engineer →

TL;DR

Scraping pipelines write data as it arrives, creating thousands of tiny files. Query engines like Athena, BigQuery, or Snowflake choke on the metadata overhead of reading them. Data compaction runs asynchronously to merge these micro-batches into optimal 128MB–1GB Parquet or Iceberg files, reducing query times from minutes to seconds and slashing S3 GET request costs.

01Definition & structure

Data compaction is an asynchronous data engineering process that reads multiple small files from a storage layer, merges their contents, and writes them back as fewer, optimally sized files. It is a critical maintenance task for data lakes and columnar databases.

A typical compaction job performs three tasks simultaneously:

  • Bin-packing: Combining small files to reach a target size (usually 128MB to 1GB).
  • Garbage collection: Removing records that were marked for deletion (tombstones) in previous micro-batches.
  • Data clustering: Sorting or Z-ordering the records to improve data skipping during future queries.
02The small file problem

Scraping pipelines naturally produce fragmented data. If a scraper extracts 50 records per minute and writes them to S3 immediately to prevent data loss, it creates 1,440 tiny files per day per worker. If you have 100 workers, that's 144,000 files a day.

When an analyst runs a SQL query against that bucket using Athena or Snowflake, the query engine must make an HTTP GET request to read the metadata of every single file before it can process the data. The network overhead dwarfs the actual data processing time, turning a 3-second query into a 5-minute query and racking up massive cloud API charges.

03LSM Trees and Write Amplification

Compaction is the core mechanism behind Log-Structured Merge (LSM) trees, the architecture used by databases like ClickHouse, Cassandra, and RocksDB. These systems accept writes incredibly fast by simply appending them to a log. In the background, they continuously compact these logs into sorted runs.

This introduces a tradeoff known as write amplification: a single scraped record might be written to disk multiple times as it gets merged into progressively larger files. You are intentionally spending background disk I/O and CPU to ensure that when a user eventually queries the data, the read path is as fast as possible.

04How DataFlirt handles it

We decouple scraping from storage optimization. Our scrapers write raw NDJSON micro-batches to an internal ingestion zone. We then use Apache Spark and Iceberg to run scheduled compaction jobs that transform this raw feed into the final client-facing dataset.

For high-throughput pipelines, we employ a two-tier compaction strategy: a fast, lightweight bin-packing job runs every hour to keep file counts under control, and a heavy, compute-intensive sort job runs nightly to globally order the data by the client's most frequently queried dimensions.

05Did you know?

Compaction doesn't just improve read speed; it actually reduces total storage size. Columnar formats like Parquet use dictionary encoding and run-length encoding to compress data. These compression algorithms are highly effective on large blocks of data but nearly useless on tiny 10KB files. Merging 1,000 small files into one large file often shrinks the total byte size by 30–50% simply because the compression algorithms finally have enough context to work efficiently.

// 03 — the storage math

The cost of
fragmentation.

Query engines spend more time opening files than reading data when files are too small. DataFlirt monitors file size distributions to trigger compaction jobs automatically before downstream SLAs are impacted.

Fragmented query time = Tquery ≈ (N × Topen) + (S / B)
When N (file count) is large, the time to open files dominates the actual bandwidth (B) used to read size (S). Data Lake tuning basics
Optimal file size target = 128 MBSfile1 GB
The sweet spot for Parquet/ORC row group reading and HDFS block alignment. Apache Iceberg guidelines
Compaction ratio = C = Nbefore / Nafter
A 100:1 ratio means 100 micro-batches were merged into a single optimized file. DataFlirt storage SLO
// 04 — compaction job trace

Merging 4,200 micro-batches
into 3 Parquet files.

A scheduled compaction run on an S3 bucket receiving continuous pricing updates from a distributed scraping fleet.

Apache IcebergS3Z-Ordering
edge.dataflirt.io — live
CAPTURED
// scan target partition
partition: "date=2026-05-19"
files_found: 4,218
avg_file_size: "42 KB" // severe fragmentation

// execute compaction
action: "rewrite_data_files"
strategy: "binpack"
target_file_size: "256 MB"
files_written: 3

// commit and cleanup
manifest_updated: true
orphaned_files_deleted: 4,218
storage_reclaimed: "18.4 MB" // compression efficiency gained
query_latency_est: -94%
// 05 — compaction triggers

When to rewrite
the data.

Compaction consumes compute. Running it too often wastes money; running it too rarely kills query performance. These are the primary triggers used to schedule compaction jobs.

AVG COMPACTION RUN ·  ·   Every 4 hours
TARGET SIZE ·  ·  ·  ·    256 MB
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

File count threshold

> 1,000 files · Absolute limit before metadata operations bottleneck the query planner
02

Average file size

< 10 MB · Triggers bin-packing to reach the 128MB+ optimal threshold
03

Tombstone / delete ratio

> 15% deleted · For CDC pipelines, purges soft-deleted records to reclaim space
04

Time-based schedule

cron trigger · Nightly or hourly runs during off-peak compute windows
05

Query latency degradation

SLO breach · Reactive trigger when downstream read performance drops
// 06 — storage architecture

Write fast,

read faster.

DataFlirt separates the write path from the read path. Our scraping workers stream raw JSON lines into an ingestion bucket to minimize latency and avoid memory bloat on the worker nodes. A background compaction engine then asynchronously converts these micro-batches into heavily compressed, Z-ordered Parquet files in the delivery bucket. Clients only ever query the optimized read path, completely insulated from the chaos of the raw ingestion stream.

compaction_job.status

Live metrics from a background compaction worker on a retail dataset.

job.id cmp-rt-099
input.files 12,405fragmented
output.files 14optimized
format Parquet · Snappy
z_order_keys [category, brand]
compute.cost $0.14
status committed

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About the small file problem, table formats, read amplification, and how DataFlirt manages storage optimization for high-volume pipelines.

Ask us directly →
What exactly is the 'small file problem'? +
When you store data in thousands of tiny files (e.g., 10KB each) instead of a few large ones, query engines spend more time reading file metadata and establishing network connections than actually processing data. In cloud storage like S3, you also pay per GET request — reading 10,000 small files costs 10,000x more in API fees than reading one large file containing the same data.
Why not just write large files directly from the scraper? +
Because scraping is inherently streaming and unpredictable. To write a 256MB file, a scraper would have to buffer gigabytes of raw HTML in memory for hours, risking massive data loss if the worker crashes or gets blocked. Writing micro-batches ensures data is safely persisted immediately; compaction cleans it up later.
Does compaction cause downtime for data consumers? +
No. Modern table formats like Apache Iceberg and Delta Lake use Multi-Version Concurrency Control (MVCC). The compaction job writes new files in the background. Once finished, it atomically swaps the table manifest. Readers querying during compaction see the old files; readers querying after see the new ones. There are no locks or blocked reads.
How does compaction handle deduplication? +
In pipelines using Change Data Capture (CDC) or upserts, compaction is when deduplication actually happens on disk. The engine reads the micro-batches, identifies multiple versions of the same record (based on a primary key), and writes only the latest version to the new compacted file, discarding the older "tombstoned" records.
What is Z-ordering in the context of compaction? +
Z-ordering is a technique used during compaction to sort the data across multiple columns simultaneously. If you frequently query a dataset by both category and brand, Z-ordering clusters related records together in the Parquet files. This allows the query engine to skip reading entire files or row groups that don't match the filter, drastically speeding up queries.
How often does DataFlirt compact delivered datasets? +
It depends on the pipeline's delivery SLA. For near-real-time feeds, we run minor compactions every 15–60 minutes to keep file counts manageable, and a major compaction nightly to enforce Z-ordering and optimal file sizes. For daily batch deliveries, compaction runs once at the end of the crawl before the data is pushed to the client.
$ dataflirt scope --new-project --target=data-compaction READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h