← Glossary / Blob Storage

What is Blob Storage?

Blob storage is a highly scalable, flat storage architecture designed for unstructured data like raw HTML payloads, JSON responses, and media files. In a scraping pipeline, it acts as the immutable raw data zone — decoupling the expensive, risky fetch layer from the extraction layer. By saving the raw response before parsing, you guarantee that schema drift or selector failures can be fixed retroactively without re-requesting the target.

Object StorageData LakeS3Raw Data ZoneDecoupling
// 02 — definitions

The immutable
raw zone.

Why storing the raw bytes before you parse them is the single most important architectural decision in a scraping pipeline.

Ask a DataFlirt engineer →

TL;DR

Blob storage (like AWS S3 or Google Cloud Storage) stores data as objects in a flat namespace rather than rows in a table. For data engineering teams, it serves as the foundational 'bronze layer' — a cheap, infinitely scalable sink for raw HTTP responses that ensures no fetched data is ever lost to a parsing error.

01Definition & structure
Blob storage (Binary Large Object storage) is a system for storing massive amounts of unstructured data. Unlike file systems with hierarchical directories or databases with tables, blob storage uses a flat namespace. Each file (the blob) is stored alongside its metadata and accessed via a unique identifier (the key). Major implementations include Amazon S3, Google Cloud Storage (GCS), and Azure Blob Storage.
02The fetch-extract decoupling
In a naive scraping script, fetching the page and extracting the data happen in the same function. If the extraction fails, the script crashes, and the fetched HTML is lost. By introducing blob storage, you decouple these steps. The fetch worker simply gets the HTML and saves it to S3. A separate extraction worker reads from S3. This means you never waste proxy bandwidth or risk rate limits just to retry a broken CSS selector.
03Cost economics and batching
While storing data in S3 is incredibly cheap (often around $0.023 per GB), cloud providers charge for API requests — specifically PUT operations. If you scrape an API returning 1KB JSON files and write each one individually to S3, your request costs will dwarf your storage costs. Production pipelines buffer these small payloads and write them to blob storage in larger, batched chunks (e.g., 50MB NDJSON files).
04How DataFlirt handles it
We treat blob storage as the source of truth for all pipeline operations. Every request our fleet makes is compressed (usually via Brotli or Gzip) and written to an S3 bucket with extensive metadata attached: the proxy ASN used, the target URL, the timestamp, and the TLS fingerprint. This allows us to run forensic audits on our own crawl data and replay historical HTML through new extraction schemas without touching the target site.
05The direct-to-database anti-pattern
A common mistake among junior data engineers is writing raw HTML payloads directly into a PostgreSQL TEXT or JSONB column. Relational databases are not designed for massive, unstructured text blobs. Doing this rapidly bloats the database, degrades query performance, and consumes expensive block storage (like AWS EBS). Always write the raw payload to blob storage, and only write the structured, extracted fields to the database.
// 03 — storage economics

How much does
raw storage cost?

Blob storage is cheap, but at scraping scale, API requests (PUT/GET) and egress bandwidth often eclipse the raw storage cost. DataFlirt models these to optimize batching.

Total Storage Cost = C = (GB_stored × rate) + (PUTs / 1000 × req_rate)
S3 charges per 1,000 PUT requests. Batching small JSONs is critical. AWS S3 Pricing Model
Compression Ratio = R = size_raw / size_gzip
Raw HTML compresses at ~85%. Always gzip before the PUT operation. DataFlirt infrastructure standards
Lifecycle Cost Optimization = Carch = Standard (7d) → Glacier (365d)
Automated tiering drops retention costs by 90% after the active re-parse window. DataFlirt retention policy
// 04 — the write path

Flushing raw responses
to the data lake.

A worker node successfully fetches a product page, compresses the raw HTML, and writes it to an S3 bucket before passing the reference to the extraction queue.

AWS S3brotli compressionmetadata tagging
edge.dataflirt.io — live
CAPTURED
// 1. fetch successful
http.status: 200 OK
payload.size_raw: 248,102 bytes

// 2. compress payload
action: "compress_brotli"
payload.size_br: 31,405 bytes // 87% reduction

// 3. generate object key
s3.bucket: "df-raw-zone-eu-west-1"
s3.key: "target_id=884/date=2026-05-19/req_id=a7f9b2.html.br"

// 4. put object with metadata
x-amz-meta-url: "https://target.com/product/123"
x-amz-meta-proxy: "res_eu_pool_4"
x-amz-meta-ja4: "t13d1516h2_8daaf6152771"
s3.put_status: 200 OK

// 5. trigger extraction
sqs.publish: "extract_queue" ref: "s3://.../req_id=a7f9b2.html.br"
// 05 — architectural benefits

Why blob storage
beats databases.

Ranked by operational impact across DataFlirt's pipeline fleet. Storing raw responses in S3 instead of a relational database solves multiple scaling bottlenecks at once.

RAW DATA STORED ·  ·  ·   14.2 PB
DAILY PUTS ·  ·  ·  ·  ·  850M+
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Re-parsing capability

decouples fetch/extract · Fix broken selectors without re-fetching
02

Storage cost per GB

$0.023/GB standard · Pennies compared to RDS or managed Postgres
03

Write concurrency

infinitely scalable · No connection pooling limits or deadlocks
04

Schema flexibility

schema-on-read · Accepts HTML, JSON, PDFs without migration
05

Lifecycle automation

native tiering · Auto-moves to Glacier after 7 days
// 06 — our architecture

Never fetch the same,

page twice for a parsing error.

At DataFlirt, every successful HTTP response is compressed and written to an S3-backed raw data zone before the extraction layer even sees it. If a target site deploys a new layout and our selectors break, we don't lose the data and we don't burn proxy bandwidth re-crawling. We simply update the schema contract, point the extraction workers at the blob storage prefix, and replay the raw HTML. Storage is cheap; residential proxy traffic and target rate limits are expensive.

s3-put-object.log

Metadata attached to a raw HTML blob in our bronze layer.

s3.bucket df-raw-zone-us-east
s3.key target=amz/date=2026-05-19/req=99a.br
size.bytes 31,405compressed
meta.status 200valid
meta.proxy_asn AS7922
lifecycle.rule glacier_transition_7d
extract.status pending

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About object storage, data lakes, compression, and how DataFlirt manages petabytes of raw scraping data.

Ask us directly →
What is the difference between blob storage and a database? +
A database (like Postgres) stores structured data in rows and columns, optimized for complex queries and joins. Blob storage (like S3) stores unstructured data as flat "objects" identified by a key. You can't run a SQL query directly against raw HTML in blob storage, but it is infinitely scalable and vastly cheaper for storing raw payloads.
Should I store raw HTML or just the extracted JSON? +
Always store the raw HTML. Extraction logic is brittle — CSS selectors break when sites update. If you only store the extracted JSON, a broken selector means lost data. If you store the raw HTML in blob storage, you can fix the selector and re-parse the historical data without having to re-scrape the target.
How do you handle millions of tiny JSON files in S3? +
Writing millions of 2KB files to S3 is an anti-pattern that results in massive PUT request costs. For high-volume API scraping, we buffer the JSON responses in memory or a fast message queue, then flush them to blob storage in batched NDJSON or Parquet files (e.g., 100MB chunks).
How does DataFlirt deliver data to clients? +
While our internal raw zone uses blob storage, we also use it for delivery. We typically push the final, cleaned datasets to a dedicated S3 or GCS bucket in the client's cloud environment, formatted as Parquet or CSV. We also support Snowflake external stages directly mapped to our delivery buckets.
What happens when the raw data zone gets too large? +
We use automated lifecycle policies. Raw HTML is kept in standard, fast-access storage for 7 to 14 days — the window during which we might need to re-parse it due to schema drift. After that, it automatically transitions to cold storage (like S3 Glacier), dropping the storage cost by ~90% while retaining the data for compliance or historical audits.
Is blob storage secure for scraped data? +
Yes, provided it is configured correctly. We enforce strict IAM roles, disable public access at the bucket level, and use AES-256 encryption at rest. Because blob storage separates compute from storage, it allows us to grant extraction workers read-only access to the raw data zone, minimizing the blast radius of any compromised worker.
$ dataflirt scope --new-project --target=blob-storage READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h