← Glossary / Apache Iceberg

What is Apache Iceberg?

Apache Iceberg is an open table format for huge analytic datasets. It brings SQL-like reliability — ACID transactions, schema evolution, and time travel — to raw data lakes stored on object storage like S3. For scraping pipelines, it solves the "many small files" problem and allows concurrent data ingestion without locking downstream consumers, turning a messy bucket of JSON and Parquet files into a queryable, versioned data warehouse.

Table FormatACIDData LakehouseS3 StorageSchema Evolution
// 02 — definitions

SQL reliability
on object storage.

How modern data pipelines decouple compute from storage while maintaining transactional guarantees across petabytes of scraped data.

Ask a DataFlirt engineer →

TL;DR

Iceberg is a metadata layer that sits between your compute engine (Spark, Trino, Snowflake) and your storage (S3, GCS). It tracks individual data files rather than directories, enabling atomic commits, schema evolution without rewriting data, and hidden partitioning. It is the backbone of the modern data lakehouse.

01Definition & structure
Apache Iceberg is not a storage engine or a query engine. It is a metadata format. It maintains a tree of metadata that defines the state of a table at a specific point in time. The structure consists of:
  • Catalog — points to the current metadata pointer.
  • Metadata file — JSON file defining table schema, partition spec, and snapshots.
  • Manifest list — Avro file tracking all manifest files for a snapshot.
  • Manifest file — Avro file tracking individual Parquet data files and their min/max stats.
  • Data files — the actual Parquet or ORC files on S3.
02How it works in practice
When a query engine like Trino wants to read an Iceberg table, it doesn't list the S3 bucket. It asks the catalog for the current metadata file. It reads the manifest list, uses the min/max statistics to skip manifests that don't match the query filter, and then reads only the specific Parquet files required. This turns a slow, expensive O(N) directory listing operation into a fast O(1) metadata lookup.
03Hidden partitioning
In older formats like Hive, partitioning was explicit in the directory structure (e.g., /year=2026/month=05/). If you queried by a timestamp column without explicitly filtering the partition column, you triggered a full table scan. Iceberg uses hidden partitioning: the metadata maps the timestamp column to the partition layout automatically. Analysts can query the timestamp naturally, and Iceberg prunes the partitions behind the scenes.
04How DataFlirt handles it
We treat Iceberg as a first-class delivery sink. Instead of dropping CSVs into an SFTP server, our extraction workers write Parquet files directly to your S3 bucket and commit the metadata to your AWS Glue or REST catalog. We handle the background maintenance — compacting small files, expiring old snapshots, and evolving the schema when target websites change — so your data engineering team doesn't have to build ingestion pipelines.
05Did you know: Time travel
Because Iceberg creates a new snapshot for every commit and doesn't immediately delete old data files, you can query the table exactly as it looked at a specific timestamp in the past. This is invaluable for machine learning reproducibility, auditing scraped data anomalies, or rolling back a bad ingestion job with a single SQL command.
// 03 — storage math

Why file tracking
beats directory listing.

Iceberg's performance advantage comes from eliminating O(N) directory listing operations on object storage. DataFlirt's delivery layer uses this to commit millions of scraped records per hour without degrading query speed.

Query Planning Time = Tplan = O(1) metadata read
Hive relies on O(N) directory listings. Iceberg reads a single manifest list. Iceberg Spec v2
Storage Footprint = S = Data + Manifests + Snapshots
Snapshots retain deleted or overwritten data until explicitly vacuumed. Storage Architecture
DataFlirt Delivery Latency = L = Write Parquet + Atomic Commit
Typically < 2 seconds for a 100k record batch to become queryable. Internal SLO
// 04 — iceberg commit trace

Atomic ingestion of
scraped records.

Trace of a DataFlirt pipeline committing a batch of 250,000 scraped product records to an Iceberg table on S3. Downstream readers see nothing until the final catalog swap.

Spark 3.4Iceberg 1.4S3a
edge.dataflirt.io — live
CAPTURED
// write data files (invisible to readers)
write.parquet: "s3://df-lake/products/data/0001.parquet"
write.parquet: "s3://df-lake/products/data/0002.parquet"

// generate metadata
manifest.create: "s3://df-lake/products/metadata/snap-123-m1.avro"
manifest_list.create: "s3://df-lake/products/metadata/snap-123.avro"

// atomic commit
catalog.swap: "v14.metadata.json" -> "v15.metadata.json"
commit.status: OK

// table metrics updated
records.added: 250,000
files.added: 2
snapshot.id: 849201938471
downstream.visibility: LIVE
// 05 — performance bottlenecks

Where Iceberg
tables degrade.

Iceberg is highly optimized, but poor write patterns can degrade read performance. Ranked by frequency of occurrence in client data lakes.

TABLES MONITORED ·  ·  ·  1,200+
COMPACTION ·  ·  ·  ·  ·  Daily
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Small file problem

I/O bound · Streaming inserts create thousands of tiny Parquet files
02

Unoptimized metadata

Plan bound · Manifest lists grow too large without rewriting
03

Missing sort orders

Scan bound · Data skipping fails because min/max stats overlap
04

Orphan files

Cost bound · Failed jobs leave unreferenced data files in S3
05

Catalog bottlenecks

Lock bound · Glue or Hive metastore throttling atomic commits
// 06 — delivery architecture

Zero-ETL delivery,

straight to your data warehouse.

Traditional scraping pipelines deliver CSV or JSON files, forcing the data engineering team to build brittle ETL pipelines to ingest them. DataFlirt delivers directly into your Iceberg catalog. We handle the compaction, the schema evolution, and the atomic commits. Your analysts just run SELECT * FROM dataflirt.scraped_products in Snowflake or Athena and get the freshest data instantly.

Iceberg Delivery Sync

Live metrics from a continuous product pricing feed.

catalog.type AWS Glue
table.name df_retail.pricing_live
commit.cadence 15 minutes
schema.version v12evolved
compaction.status optimal
orphan.files 0
last.snapshot 849201938471

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About table formats, schema evolution, compute engines, and how DataFlirt manages Iceberg delivery at scale.

Ask us directly →
What is the difference between Iceberg and Delta Lake? +
Both solve the same problem: bringing ACID transactions to data lakes. Delta Lake is tightly coupled with Databricks and uses a JSON transaction log. Iceberg is fully open-source, uses Avro for metadata, and is generally preferred for multi-engine environments (Trino, Snowflake, Athena). We support both, but Iceberg is our default for vendor-neutral delivery.
Can I query Iceberg tables with Snowflake or BigQuery? +
Yes. Both Snowflake and BigQuery support Iceberg as external tables. You point them at the Iceberg catalog (or metadata file), and they can query the S3/GCS data directly without copying it into their internal storage. This drastically reduces your storage costs while keeping the compute power of the warehouse.
How does schema evolution work without rewriting data? +
Iceberg tracks columns by unique IDs, not by name. If you rename a column, the ID stays the same. If you drop a column and add a new one with the same name, it gets a new ID. Old data files simply return null for new columns, and new data files omit dropped columns. No data rewriting is required.
How does DataFlirt handle GDPR deletions in Iceberg? +
Iceberg supports row-level deletes using merge-on-read (MOR) or copy-on-write (COW). When a deletion request is processed, we write a delete file that masks the record immediately. During our nightly compaction jobs, the underlying Parquet files are rewritten to physically remove the data, ensuring compliance without stalling the live pipeline.
Do I need Spark to use Iceberg? +
No. While Spark is the most common engine for writing to Iceberg, you can write to it using Flink, Trino, or even pure Python via PyIceberg. For reading, almost every modern query engine supports Iceberg natively.
How do you handle the small file problem from continuous scraping? +
Continuous scraping generates thousands of small files, which kills query performance. DataFlirt runs asynchronous compaction jobs. We ingest data rapidly in small batches, and a background process quietly merges those small Parquet files into optimized 128MB chunks and updates the Iceberg manifest — all without locking the table for readers.
$ dataflirt scope --new-project --target=apache-iceberg READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h