← Glossary / Raw Data Zone

What is Raw Data Zone?

Raw data zone is the foundational storage layer in a data lake where scraped payloads—HTML, JSON, XML, or binary—are dumped exactly as they were fetched, before any parsing or transformation occurs. It acts as an immutable, append-only ledger of truth. If a downstream extraction schema breaks or a business requirement changes, the raw zone allows you to replay the pipeline without re-fetching the target, saving bandwidth, compute, and anti-bot risk.

Data LakeBronze LayerImmutable StorageETLData Ingestion
// 02 — definitions

Store first,
parse later.

Why dropping raw HTTP responses into object storage before touching them is the only way to build a resilient scraping pipeline.

Ask a DataFlirt engineer →

TL;DR

The raw data zone (often called the bronze layer) stores the exact bytes returned by the target server. It decouples the fetch operation from the extraction logic. When selectors break—and they will—you can fix the parser and backfill missing fields from the raw zone in seconds, rather than re-crawling millions of pages and risking IP bans.

01Definition & structure
The raw data zone is the first landing area in a data pipeline. It is an object storage bucket (like AWS S3 or GCS) where fetched payloads are saved exactly as the target server returned them. It is strictly append-only and immutable. Files are typically partitioned by target domain, year, month, and day to allow for efficient prefix scanning during backfill operations.
02Decoupling fetch from extract
In amateur scripts, fetching a page and parsing it happen in the same function. If the parser fails, the script crashes, and the fetched data is lost. By writing to a raw data zone first, you decouple the expensive, risky operation (fetching via proxies) from the cheap, deterministic operation (parsing DOM). If extraction fails, the raw HTML is safely on disk, ready to be re-processed once the selector is fixed.
03Storage formats and compression
Raw zones rarely store plain text. HTML and JSON are highly repetitive and compress exceptionally well. Modern pipelines use zstd or gzip to compress payloads before writing them to object storage. Alternatively, some architectures use WARC (Web ARChive) files or NDJSON (Newline Delimited JSON) to batch multiple small page responses into larger, more efficient 100MB+ blocks to reduce S3 PUT request costs.
04How DataFlirt handles it
We treat the raw zone as the ultimate source of truth. Every successful HTTP 200 response is compressed and written to S3 before the extraction worker is even notified. We append rich metadata to the S3 object tags—including the proxy exit IP, the exact timestamp, and the target URL—so that if a client questions a data point weeks later, we can produce the exact HTML document that generated it.
05The cost of skipping the raw zone
Pipelines without a raw zone suffer from "schema debt." If you only save the fields you thought you needed on day one, you have no way to recover fields you realise you need on day ninety. You are forced to re-scrape the target, which costs money, burns proxy reputation, and only gives you the current state of the target, permanently losing the historical data you originally fetched.
// 03 — storage economics

The cost of
keeping everything.

Storing raw HTML for millions of pages sounds expensive until you compare it to the cost of residential proxy bandwidth required to re-scrape them. DataFlirt models retention based on this ratio.

Raw storage cost = Cstorage = bytes × pages × days × $0.023/GB
Standard S3 pricing. Aggressive zstd compression typically reduces HTML bytes by 85%. AWS Pricing Model
Re-scrape cost = Cfetch = pages × proxy_cost_per_req + compute
Fetching is always more expensive than storing, especially on premium residential networks. DataFlirt infrastructure economics
Retention ROI = R = Cfetch / (Cstorage × failure_rate)
If R > 1, storing the raw payload is cheaper than risking a re-scrape due to schema drift. DataFlirt pipeline planner
// 04 — raw ingestion trace

Landing a payload
in the bronze layer.

A scraper worker successfully fetches a product page, compresses the raw HTML, and writes it to the raw data zone before passing the pointer to the extraction queue.

S3 PutObjectzstd compressionappend-only
edge.dataflirt.io — live
CAPTURED
// fetch complete
target.url: "https://shop.example.com/item/9921"
response.status: 200 OK
response.bytes_raw: 248,102

// prepare for raw zone
compress.algo: "zstd"
compress.bytes_out: 31,405 // 87% reduction
metadata.fetch_ts: "2026-05-19T08:12:44Z"
metadata.proxy_ip: "203.0.113.42"

// write to object storage
s3.bucket: "df-raw-zone-eu-west-1"
s3.key: "shop_example/2026/05/19/item_9921_1716106364.html.zst"
s3.put: SUCCESS

// trigger downstream
sqs.publish: "extract_queue" payload: s3.key
// 05 — raw zone architecture

What makes a
good raw zone.

The raw data zone must be treated as an immutable ledger. If you modify records here, you destroy the pipeline's audit trail. These are the architectural non-negotiables.

AVG COMPRESSION ·  ·  ·   85% (zstd)
TYPICAL TTL ·  ·  ·  ·    30–90 days
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Immutability

strict append-only · Never update a raw file. Fetch again? Write a new file.
02

Date-based partitioning

s3://.../YYYY/MM/DD/ · Crucial for efficient backfilling and lifecycle policies.
03

Aggressive compression

zstd or gzip · Raw HTML is highly compressible; uncompressed storage is waste.
04

Rich metadata tagging

headers, IPs, TS · Store the context of the fetch alongside the payload.
05

Automated lifecycle

TTL to Glacier · Move to cold storage after 30 days to control AWS bills.
// 06 — DataFlirt's ingestion

Never throw away,

what you might need to parse tomorrow.

At DataFlirt, every pipeline writes to a dedicated raw data zone before the extraction workers even wake up. We compress raw HTML and JSON payloads using zstd, tag them with the exact proxy exit IP, JA3 fingerprint, and timestamp used for the fetch, and partition them by target and day. When a client asks for a new field to be added to their dataset three months into a contract, we don't start from zero. We just point the updated extraction schema at the raw zone and time-travel.

Raw Object Metadata

S3 object tags and metadata for a single raw HTML payload.

s3.uri s3://df-raw/tgt_88/2026/05/19/doc.zst
size.compressed 31.4 KB
fetch.status 200 OK
fetch.proxy_asn AS7922 · Comcast
fetch.latency 1,204 ms
pii.redacted false
lifecycle.tier StandardTTL: 30d

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About data lake architecture, storage costs, compliance, and why decoupling fetch from extraction is mandatory for scale.

Ask us directly →
What is the difference between a raw data zone and a bronze layer? +
They are synonyms. "Bronze layer" is the terminology used in the Medallion Architecture (popularised by Databricks), while "raw data zone" is the traditional data lake term. Both refer to the exact same concept: the landing area for untransformed, immutable source data.
Should we store the raw HTML, or just the JSON we extract from it? +
Store the raw HTML. If you only store the extracted JSON, you have permanently lost any data on the page that wasn't in your schema at the exact moment of the fetch. If your schema was wrong, or if business requirements change, you will have to re-scrape the target.
How long should we retain data in the raw zone? +
Typically 30 to 90 days in hot storage (S3 Standard), followed by a transition to cold storage (Glacier) or deletion. The retention window should match your longest expected debugging or backfill cycle. Storing raw HTML forever is rarely cost-effective unless you are building an internet archive.
Does storing raw data violate GDPR or CCPA? +
It can, if the raw HTML contains Personally Identifiable Information (PII) and you do not have a lawful basis to process it. Because the raw zone is schema-on-read, you cannot easily filter PII before it lands. If scraping targets with mixed public/private data, you must implement strict TTLs (Time To Live) or pre-storage redaction proxies.
How does DataFlirt handle the storage costs of millions of raw pages? +
Aggressive compression and strict lifecycle policies. We use zstd dictionary compression, which routinely achieves 85–90% reduction on HTML payloads. After 14 days, payloads transition to S3 Standard-IA, and after 30 days, they are deleted unless the client specifically contracts for historical archive access.
Can we query the raw data zone directly for analytics? +
No. The raw zone is schema-on-read and heavily compressed. Querying it directly with tools like Athena or Presto requires scanning massive amounts of unstructured text, which is slow and expensive. The raw zone exists solely to feed the ETL pipeline that populates your structured (Silver/Gold) tables.
$ dataflirt scope --new-project --target=raw-data-zone READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h