← Glossary / Bronze Layer

What is Bronze Layer?

The bronze layer is the raw, append-only landing zone in a medallion data architecture where fetched content is stored exactly as it was received. In scraping pipelines, this means saving the raw HTML, JSON, or XML payloads alongside fetch metadata before any extraction or parsing occurs. It acts as an immutable historical record, allowing you to replay extraction logic when selectors break without needing to re-fetch the target.

Medallion ArchitectureData LakeRaw DataImmutabilityReplayability
// 02 — definitions

Store first,
parse later.

Why dropping raw HTTP responses into an immutable object store is the only way to build a resilient scraping pipeline.

Ask a DataFlirt engineer →

TL;DR

The bronze layer is the foundation of the medallion architecture (Bronze, Silver, Gold). For web scraping, it stores the exact bytes returned by the target server. If your downstream extraction schema drifts or a CSS selector fails, the bronze layer guarantees you can backfill the missing data without executing another HTTP request.

01Definition & structure
In a medallion data architecture, the bronze layer is the initial landing zone for all ingested data. It is strictly append-only and immutable. In the context of web scraping, this means storing the raw HTTP response body (HTML, JSON, XML) and its associated metadata (headers, timestamp, target URL) exactly as it was received from the target server, before any parsing or schema validation takes place.
02How it works in practice
When a crawler fetches a page, it does not immediately extract the data. Instead, it serializes the raw response into a compressed format (like a JSON record containing a Base64-encoded HTML string) and writes it to an object store like Amazon S3. A separate asynchronous process then reads this bronze record, applies CSS selectors or JSON paths to extract the desired fields, and writes the structured output to the silver layer.
03The replayability advantage
Websites change constantly. If a target updates its DOM structure, your extraction logic will fail, resulting in missing data. If you only store the extracted fields, that data is lost forever. By maintaining a bronze layer, you can update your extraction script and simply "replay" the raw HTML from the bronze storage. You recover the missing data without having to send new HTTP requests to the target.
04How DataFlirt handles it
We decouple fetching from extraction entirely. Our fetch fleet writes exclusively to the bronze layer using high-throughput, compressed streams. Our extraction workers then consume these streams. This architecture allows us to scale fetching and extraction independently, guarantees zero data loss during schema drift, and drastically reduces our proxy bandwidth costs by eliminating the need to re-crawl targets during pipeline debugging.
05Cost vs. value of raw storage
A common misconception is that storing raw HTML for millions of pages is prohibitively expensive. In reality, cloud object storage is incredibly cheap, especially when data is compressed. The cost of storing a month's worth of compressed bronze data is almost always lower than the proxy bandwidth and compute costs required to re-scrape a target after an extraction failure.
// 03 — the storage model

Calculating the cost
of raw retention.

Storing raw HTML is cheap; re-fetching it is expensive. DataFlirt models bronze layer retention based on the proxy and compute cost of a cache miss versus object storage costs.

Bronze storage cost = Cstorage = bytes × records × retention_months × $0.023
Standard S3 pricing per GB. Compression typically reduces this by 85%. AWS Pricing Model
Re-fetch penalty = Crefetch = proxy_cost + compute_cost + block_risk
The cost incurred if you have to hit the target again because you didn't save the raw HTML. DataFlirt Pipeline Economics
DataFlirt retention ratio = R = bronze_volume / gold_volume
Typically 100:1 in HTML scraping. 2MB of raw DOM yields 20KB of structured JSON. Internal Telemetry
// 04 — bronze ingestion

Writing a raw payload
to the data lake.

A scraper worker successfully fetches a product page and writes the raw HTML and fetch metadata to the bronze layer in S3 before passing the reference to the extraction queue.

S3 PutObjectzstd compressionappend-only
edge.dataflirt.io — live
CAPTURED
// worker-042 fetch complete
target.url: "https://target-ecommerce.in/p/12345"
fetch.status: 200 OK
fetch.latency_ms: 842

// prepare bronze record
record.id: "brz_8f9a2b1c"
record.timestamp: "2026-05-19T08:14:22Z"
payload.raw_bytes: 2,410,112
payload.compressed_bytes: 341,020 // zstd level 3

// write to object store
s3.bucket: "df-lakehouse-bronze-ap-south-1"
s3.key: "inbound/ecommerce/2026/05/19/brz_8f9a2b1c.json.zst"
s3.put_status: 200 OK

// trigger downstream
event.publish: "bronze_record_created"
extraction_queue: ACKNOWLEDGED
// 05 — bronze layer metadata

What gets saved
alongside the bytes.

A bronze record isn't just the HTML. To replay a scrape accurately or debug a block, you need the exact context of the fetch. These are the metadata fields attached to every bronze payload.

FORMAT ·  ·  ·  ·  ·  ·   JSON Lines / Parquet
COMPRESSION ·  ·  ·  ·    Zstandard (zstd)
MUTABILITY ·  ·  ·  ·  ·  Append-only
01

Raw Response Body

The actual payload · HTML, JSON, or XML exactly as returned by the server.
02

Fetch Timestamp

Temporal context · Crucial for point-in-time replays and time-series ordering.
03

Target URL & Method

Request context · The exact endpoint hit, including query parameters.
04

Response Headers

Server metadata · Captures caching directives, set-cookies, and server types.
05

Proxy & Exit Node IP

Network context · Used to debug geo-blocks or proxy-specific cloaking.
// 06 — our architecture

Never scrape the same page twice,

unless the data has actually changed.

DataFlirt treats the bronze layer as the ultimate source of truth. Our extraction workers don't talk to the internet; they read from the bronze layer. When a client requests a new field that wasn't in the original schema, we don't need to launch a new crawl. We just deploy the updated parser and replay the last 30 days of bronze records. Storage is cheap; residential proxy bandwidth and target goodwill are not.

Bronze Record Schema

The standard envelope for a raw fetch event in our data lake.

record_id UUIDv7
job_id crawl_batch_992
fetch_timestamp ISO 8601 UTC
request_meta URL, Method, Headers
response_meta Status, Headers, Latency
raw_payload Base64 Encoded Bytes
compression zstd

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about managing raw data, storage costs, and medallion architecture in scraping pipelines.

Ask us directly →
What is the difference between Bronze, Silver, and Gold layers? +
Bronze is the raw, unparsed data exactly as fetched. Silver is the cleaned, parsed, and normalized data (e.g., HTML converted to structured JSON with correct data types). Gold is the business-level aggregation (e.g., daily average price per SKU, ready for analytics).
How long should you retain bronze data? +
It depends on your storage budget and compliance needs. We typically recommend 30 to 90 days for active pipelines. This provides a sufficient window to catch silent extraction failures, fix the selectors, and replay the raw data to backfill the missing structured records.
Should bronze data be compressed? +
Absolutely. Raw HTML is highly compressible. Using algorithms like Zstandard (zstd) or Gzip can reduce storage footprints by 80-90% with minimal CPU overhead during ingestion. We write compressed JSON Lines or Parquet files directly to S3.
How does DataFlirt handle PII in the bronze layer? +
If a target is known to contain Personally Identifiable Information (PII) that we are not contracted to process, we apply redaction at the edge before the payload hits the bronze bucket. However, for public catalog data, the bronze layer remains an exact, unredacted mirror of the public web.
Do you store images and media in the bronze layer? +
Generally, no. Storing binary media inflates bronze layer costs exponentially. We extract the image URLs and store those in the bronze metadata. If the client requires the actual image files, they are fetched and routed to a dedicated media bucket, not the primary text-based bronze lake.
What happens when the bronze layer gets too large? +
We use cloud lifecycle policies to manage scale. After the active replay window (e.g., 30 days), bronze records are automatically transitioned to colder, cheaper storage tiers like S3 Glacier. After a set compliance period (e.g., 1 year), they are permanently deleted.
$ dataflirt scope --new-project --target=bronze-layer READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h