← Glossary / Data Lakehouse

What is Data Lakehouse?

Q: How does DataFlirt handle deduplication in a lakehouse?

We use the MERGE INTO capabilities of Delta Lake and Iceberg. Instead of appending duplicate records and forcing you to deduplicate downstream, our delivery workers perform an ACID upsert. If a scraped product's price changed, we update the existing row; if it's new, we insert it. Your table always reflects the current state.

Q: What is the Medallion architecture in scraping?

It's a data design pattern for lakehouses. Bronze holds the raw, unparsed JSON/HTML exactly as scraped. Silver holds cleaned, typed, and deduplicated records. Gold holds business-level aggregates (e.g., daily average price per category). DataFlirt typically delivers directly to your Silver or Gold layer, bypassing the need for you to build the parsing logic.

A data lakehouse is a modern data architecture that merges the cheap, scalable storage of a data lake with the ACID transactions and schema enforcement of a data warehouse. For scraping pipelines, it solves the fundamental tension between storing massive volumes of raw, semi-structured JSON payloads and delivering clean, queryable SQL tables to downstream consumers without maintaining two separate systems.

Data EngineeringDelta LakeApache IcebergACIDMedallion Architecture

// 02 — definitions

Best of
both worlds.

Why modern data teams stopped copying data between S3 buckets and Snowflake, and started running SQL directly on their raw storage.

Ask a DataFlirt engineer →

TL;DR

A data lakehouse uses open table formats like Apache Iceberg, Delta Lake, or Apache Hudi to bring warehouse-like reliability to cloud object storage. It eliminates the ETL step of moving scraped data from a data lake into a proprietary warehouse, allowing analysts to query raw extraction outputs minutes after the scraper writes them.

01Definition & structure

A data lakehouse is an architecture that implements data management features—ACID transactions, schema enforcement, and data governance—directly on top of low-cost cloud object storage (like AWS S3 or Google Cloud Storage).

It relies on open table formats (Apache Iceberg, Delta Lake, Apache Hudi) which maintain a metadata layer tracking which underlying Parquet files belong to which version of a table. This allows compute engines (Spark, Trino, Athena) to query the data lake as if it were a traditional relational database.

02The Medallion Architecture

Lakehouses typically organize data into three progressive layers of quality:

Bronze: Raw data exactly as ingested. For scraping, this is the raw HTML or JSON payloads.
Silver: Filtered, cleaned, and augmented data. Types are cast, duplicates are merged, and schemas are enforced.
Gold: Business-level aggregates ready for BI tools and reporting.

Because all three layers live on the same cheap storage, you don't pay a premium to keep the raw Bronze data around for debugging or backfilling.

03Time Travel and Auditing

Because lakehouse table formats use immutable data files and track changes via metadata snapshots, they natively support "time travel." You can append FOR SYSTEM_TIME AS OF '2026-01-01' to your SQL query to see exactly what the scraped data looked like on that date.

This is critical for scraping pipelines where target sites frequently change their layouts. If a bug in the extraction logic corrupts the Silver layer, you can easily roll back the table state to before the bad pipeline run and re-process the raw Bronze data.

04How DataFlirt handles it

We treat your lakehouse as a first-class delivery destination. Instead of dropping CSVs into an SFTP server, DataFlirt's delivery workers connect directly to your AWS Glue or Hive metastore. We write Parquet files to your S3 bucket and commit the ACID transactions to your Iceberg or Delta tables.

We handle the MERGE logic to ensure you never get duplicate records, and we safely evolve the schema if we detect new fields on the target site. Your data engineers don't have to write a single line of ingestion code.

05The end of vendor lock-in

In a traditional data warehouse, your data is stored in a proprietary format. If you want to switch vendors, you have to export everything. In a lakehouse, your data is stored in open-source Parquet files in your own S3 bucket. You can query the exact same table using Databricks today, Snowflake tomorrow, and open-source Trino the next day, without moving a single byte of data.

// 03 — the economics

Why lakehouses
win on cost.

By decoupling compute from storage and eliminating redundant data copies, lakehouses drastically reduce the total cost of ownership for high-volume scraping pipelines. You pay S3 prices for storage, and only pay for compute when you actually query.

Storage Cost = C_storage = Volume_TB × $23/month

Standard S3 pricing. A fraction of the cost of proprietary warehouse storage. AWS Pricing, 2026

Compute Cost = C_compute = Query_Time × Engine_Rate

Compute is ephemeral. Spin up Trino/Athena, run the query, spin down. Decoupled Architecture Model

Data Freshness = T_fresh = T_scrape + T_merge

Zero ETL load time. Data is queryable the moment the ACID transaction commits. DataFlirt Delivery SLO

// 04 — lakehouse delivery

Writing scraped records
to an Iceberg table.

A DataFlirt delivery worker appending 50,000 newly scraped product records to a client's Apache Iceberg table. The transaction handles schema evolution on the fly when a new field is detected.

Apache IcebergS3Schema Evolution

edge.dataflirt.io — live

CAPTURED

// init delivery job
target.catalog: "glue_catalog"
target.table: "bronze.ecommerce_products"
records.count: 50,000

// schema validation
schema.current_version: 24
schema.detected_drift: true // new field 'eco_rating' found
iceberg.alter_table: ADD COLUMN eco_rating string
schema.new_version: 25

// write & commit
write.data_files: 4 (parquet)
write.bytes: 14.2 MB
iceberg.manifest_list: "snap-892347129.avro"
transaction.status: COMMITTED

// downstream availability
athena.queryable: true
latency.end_to_end: 1.2s

// 05 — architecture shifts

Why pipelines
are migrating.

The primary drivers for moving scraping workloads from traditional data warehouses to lakehouse architectures, ranked by impact on data engineering teams.

LAKEHOUSE ADOPTION · · 78% of new pipelines

ETL REDUCTION · · · · ~40% less code

UPDATED · · · · · · 2026-05-19

Storage cost reduction

S3 vs Warehouse · Storing petabytes of raw HTML/JSON is unviable in Snowflake.

ACID guarantees on S3

Data reliability · No more dirty reads or partial file writes during scraper crashes.

Schema evolution

Flexibility · Target sites change; lakehouses adapt schemas without rewriting data.

Time travel / Auditing

Reproducibility · Query the table exactly as it looked 30 days ago.

Vendor lock-in avoidance

Open formats · Parquet + Iceberg/Delta means you own your data format.

// 06 — DataFlirt delivery

Direct to your lakehouse,

no ETL required.

We don't just dump raw JSON into an S3 bucket and leave the parsing to you. DataFlirt writes directly to your Delta Lake or Apache Iceberg tables. We handle the ACID merges, schema evolution, and deduplication at the edge. When our pipeline finishes a run, your analysts can instantly query the new data using Athena, Trino, or Databricks—with zero intermediate ETL steps. You own the storage, we manage the compute to keep it clean.

Lakehouse Delivery Job

Live trace of a DataFlirt pipeline writing to a client's Delta Lake.

job.id lakehouse-sync-092

target.format Delta Lake

target.storage s3://df-client-prod/gold/

operation MERGE INTOupsert

records.inserted 12,400

records.updated 3,102

schema.drift none

commit.status SUCCESS

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about lakehouse architectures, open table formats, and how DataFlirt integrates with modern data stacks.

Ask us directly →

What is the difference between a data lake, a data warehouse, and a data lakehouse? +

A data lake stores raw, unstructured data cheaply but lacks transactions and schema enforcement. A data warehouse provides strict schemas, fast SQL, and ACID transactions, but is expensive and struggles with unstructured data. A data lakehouse puts a transactional metadata layer (like Iceberg or Delta) on top of a data lake, giving you warehouse features at data lake prices.

Do I need Databricks to have a data lakehouse? +

No. While Databricks coined the term and created Delta Lake, the architecture is now entirely open. You can build a lakehouse using AWS S3, Apache Iceberg, and AWS Athena without ever touching Databricks. The defining feature is the open table format, not the vendor.

How does DataFlirt handle deduplication in a lakehouse? +

We use the MERGE INTO capabilities of Delta Lake and Iceberg. Instead of appending duplicate records and forcing you to deduplicate downstream, our delivery workers perform an ACID upsert. If a scraped product's price changed, we update the existing row; if it's new, we insert it. Your table always reflects the current state.

What is the Medallion architecture in scraping? +

It's a data design pattern for lakehouses. Bronze holds the raw, unparsed JSON/HTML exactly as scraped. Silver holds cleaned, typed, and deduplicated records. Gold holds business-level aggregates (e.g., daily average price per category). DataFlirt typically delivers directly to your Silver or Gold layer, bypassing the need for you to build the parsing logic.

How does schema evolution work when a target site changes? +

When a target site adds a new field, our extraction layer detects it. Because open table formats support safe schema evolution, our delivery worker issues an ALTER TABLE ADD COLUMN command natively during the write transaction. The new column is added, old records return null for that column, and the pipeline never breaks.

Can I query the lakehouse in real-time? +

Lakehouses are optimized for micro-batching (e.g., 1-minute to 15-minute intervals), not true millisecond real-time streaming. For 99% of scraping use cases—pricing intelligence, catalog monitoring, alternative data—a 5-minute latency from scrape to queryable lakehouse table is more than sufficient.

$ dataflirt scope --new-project --target=data-lakehouse READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h

What is Data Lakehouse?

Best ofboth worlds.

TL;DR

Why lakehouseswin on cost.

Writing scraped recordsto an Iceberg table.

Why pipelinesare migrating.

Storage cost reduction

ACID guarantees on S3

Schema evolution

Time travel / Auditing

Vendor lock-in avoidance

Direct to your lakehouse,

Lakehouse Delivery Job

Stay ahead of the pipeline

Data engineeringintel, weekly.

Commonquestions.

Tell us whatto extract.We do the rest.

Related glossary terms

Apache Iceberg

Delta Lake

Data Lake

Data Warehouse