← Glossary / Data Lake

What is Data Lake?

Data lake is a centralized storage repository that holds vast amounts of raw, unstructured, and semi-structured data in its native format until it is needed. For scraping pipelines, it acts as the immutable landing zone for raw HTML, JSON payloads, and binary assets before any extraction or schema enforcement occurs. Bypassing a data lake and writing scraped data directly to a warehouse guarantees that when your extraction logic inevitably fails, the underlying source data is lost forever.

StorageRaw DataS3 / GCSSchema-on-ReadData Engineering

// 02 — definitions

Store everything,
parse later.

The architectural pattern that decouples data ingestion from data extraction, ensuring you never lose the raw source material when downstream schemas change.

Ask a DataFlirt engineer →

TL;DR

A data lake stores raw scraped payloads (HTML, JSON, PDFs) exactly as they were fetched. It uses cheap object storage (like AWS S3 or MinIO) and a schema-on-read approach, allowing data engineers to re-run extraction logic on historical data months after the target website has changed.

01Definition & structure

A data lake is a flat, scalable storage architecture designed to hold vast amounts of raw data in its native format. Unlike a database that requires data to be structured before it is saved, a data lake accepts anything: raw HTML strings, nested JSON API responses, images, and PDFs. It relies on object storage (like AWS S3, Google Cloud Storage, or Azure Blob) and uses a directory-like prefix structure (partitioning) to organize files by date, source, or job ID.

02The role of the lake in scraping

In a resilient scraping pipeline, the data lake sits immediately after the fetch layer and before the extraction layer. When a crawler downloads a page, the raw bytes are immediately compressed and written to the lake. Only after the write is confirmed does the pipeline attempt to parse the HTML and extract structured fields. This decoupling ensures that if the target site changes its DOM structure and breaks your CSS selectors, you haven't lost the data — you just update the selectors and re-process the raw files sitting in the lake.

03Partitioning strategy

A lake without partitions is just a digital dumping ground. Because object storage doesn't have indexes like a relational database, you must organize files so that extraction workers can find them efficiently. The standard approach is Hive-style partitioning: s3://bucket/raw/domain=example.com/year=2026/month=05/day=19/. This allows a worker to pull exactly one day's worth of data for a specific target without scanning millions of irrelevant objects.

04How DataFlirt handles it

We mandate a raw data lake for every production pipeline. To avoid the S3 small-file performance penalty, our fetch workers don't write individual HTML files. Instead, they stream raw payloads into an in-memory buffer, compress them using zstd, and flush them to the lake as 10MB+ NDJSON blocks. Each line in the block contains the raw HTML, the target URL, the HTTP status code, and the exact fetch timestamp. This makes historical replays incredibly fast and cost-efficient.

05Schema-on-read vs Schema-on-write

Data lakes operate on schema-on-read. You don't need to know what fields you want to extract when you save the HTML; you only apply the schema (the extraction logic) when you read the data out of the lake. Warehouses operate on schema-on-write, meaning the data must be perfectly formatted before it can be inserted. Scraping is inherently chaotic, making schema-on-read the only viable approach for the ingestion layer.

// 03 — the economics

Why raw storage
is practically free.

The cost of storing raw HTML is negligible compared to the cost of re-scraping a target or losing historical data. DataFlirt models lake storage as a baseline operational insurance policy.

Storage Cost Efficiency = C_lake = V_gb · $0.023

Standard S3 pricing per GB. Compressing HTML (gzip/zstd) reduces this by ~80%. AWS Pricing Model

Data Recovery Rate = R = re_extracted_records / failed_records

Approaches 1.0 only if the raw payload was persisted in the lake before parsing. Pipeline Reliability Metric

DataFlirt Ingestion Latency = L = t_fetch + t_compress + t_write

Time from HTTP 200 to persisted raw object. P99 < 120ms across our fleet. Internal SLO

// 04 — lake ingestion trace

Persisting the raw
payload to S3.

A live trace of a DataFlirt worker fetching a retail product page and writing the raw HTML to the data lake before any extraction logic is triggered.

S3 PutObjectzstd compressionevent trigger

edge.dataflirt.io — live

CAPTURED

// inbound fetch payload
job.id: "scrape-in-retail-092"
payload.type: "text/html"
payload.size_raw: 2.4 MB

// compression & partitioning
action: "compress"
format: "zstd" // size reduced to 310 KB
partition_key: "target=example.com/year=2026/month=05/day=19/"

// write to lake
destination: "s3://df-raw-lake-prod/retail/"
s3.put_status: 200 OK
write_latency: 42ms

// downstream trigger
event: "s3:ObjectCreated:Put"
trigger: "extraction_worker_pool"
pipeline.state: raw_persisted

// 05 — lake failure modes

Where data lakes
turn into swamps.

Ranked by frequency of occurrence in poorly managed scraping infrastructure. A data lake without strict partitioning and lifecycle rules quickly becomes an unqueryable cost center.

LAKES MONITORED · · · 140+ buckets

AVG OBJECT SIZE · · · 12 MB (batched)

UPDATED · · · · · · 2026-05-19

01

The small file problem

I/O bottleneck · Millions of 50KB HTML files killing S3 GET performance

02

Missing partition keys

scan overhead · Forcing full-bucket scans to find one day's data

03

Absent fetch metadata

context loss · Raw files saved without HTTP status or timestamp tags

04

Storage cost bloat

financial · Failing to transition old raw data to Glacier/Cold

05

Zombie data ingestion

operational · Data fetched and stored but never extracted or used

// 06 — our architecture

Never parse on the wire,

persist the raw bytes first.

DataFlirt's ingestion architecture treats the data lake as the ultimate source of truth. Every HTTP response body we fetch is compressed and written to an S3 bucket partitioned by target and timestamp before any extraction logic touches it. If a client realizes they needed an additional field from a site we scraped six months ago, we don't need a time machine. We just update the schema and replay the extraction workers over the historical raw HTML in the lake.

lake_write_operation

Metadata for a batched write of raw HTML payloads to the DataFlirt raw zone.

bucket df-client-raw-zone

partition target=b2b_mfg/dt=2026-05-19

compression zstdratio: 8.1

file_size 14.2 MB · batched

metadata.status 200 OK

lifecycle_rule 90_days_to_glacier

state persisted

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About data lake architecture, the small file problem, storage economics, and how DataFlirt manages raw scraping payloads at scale.

Ask us directly →

What is the difference between a data lake and a data warehouse? +

A data lake stores raw, unstructured data (like HTML or JSON) using a schema-on-read approach — you figure out the structure when you query it. A data warehouse stores highly structured, tabular data using schema-on-write — the data must fit the table definition before it can be saved. Scraping pipelines need both: the lake for raw payloads, the warehouse for the extracted output.

Why not just extract the data in memory and discard the HTML? +

Because selectors break and schemas evolve. If you extract in memory and discard the raw HTML, any parsing error means that data is gone forever. If you persist the HTML to a data lake first, a broken selector just means you fix the code and replay the extraction step over the saved files. It turns catastrophic data loss into a minor processing delay.

How do you handle the 'small file problem' in S3? +

Writing millions of 50KB HTML files to S3 destroys read performance and inflates PUT request costs. We use an ingestion buffer that batches raw payloads into larger (10MB–100MB) NDJSON or Parquet blocks before writing them to the lake. Each block contains the raw HTML alongside fetch metadata (URL, timestamp, status code).

Does DataFlirt provide clients access to the raw data lake? +

Yes. For enterprise pipelines, we can replicate the raw HTML/JSON payloads directly to your own S3 or GCS buckets alongside the structured delivery feeds. This gives your internal data science teams the ability to run their own custom NLP or extraction models on the source material.

How do you manage storage costs for high-volume scraping? +

Strict lifecycle policies. Raw HTML is highly compressible (often 80%+ reduction with zstd). We keep the compressed raw payloads in S3 Standard for 7 to 14 days to support immediate extraction replays and QA. After that, it automatically transitions to S3 Glacier or is deleted, depending on the client's retention requirements.

What is a Data Lakehouse? +

A lakehouse is a modern architecture that adds warehouse-like features (ACID transactions, schema enforcement, time travel) directly on top of cheap data lake storage. It uses open table formats like Apache Iceberg, Delta Lake, or Apache Hudi. It allows you to query the lake with high-performance SQL without moving the data into a separate warehouse.

$ dataflirt scope --new-project --target=data-lake READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h