← Glossary / Object Storage (S3)

What is Object Storage (S3)?

Object Storage (S3) is the foundational persistence layer for modern data pipelines, storing unstructured or semi-structured data as discrete objects rather than blocks or files. For scraping infrastructure, it serves as the ultimate raw data sink—holding everything from raw HTML payloads and JSON API responses to extracted Parquet files. It decouples storage from compute, allowing massive parallel writes without the locking contention that cripples traditional relational databases at scale.

Data LakeBlob StorageAWS S3Data EngineeringPersistence
// 02 — definitions

Infinite scale,
zero locks.

Why scraping pipelines dump raw payloads into object storage before attempting any structured extraction or transformation.

Ask a DataFlirt engineer →

TL;DR

Object storage treats data as immutable blobs with metadata, accessed via HTTP APIs. It is the only storage paradigm that can handle thousands of concurrent scraping workers writing raw HTML simultaneously without connection pooling limits or schema constraints. AWS S3 is the industry standard, but GCS, Azure Blob, and MinIO follow the exact same architecture.

01Definition & structure
Object Storage is a data storage architecture that manages data as objects, as opposed to file systems which manage data as a file hierarchy, and block storage which manages data as blocks within sectors and tracks. Each object typically includes the data itself, a variable amount of metadata, and a globally unique identifier (the key). AWS S3 is the most prominent implementation, but the paradigm is universal across modern cloud providers.
02The raw data zone
In a scraping context, S3 is used as the "raw data zone" or "bronze layer." Instead of parsing HTML in memory and discarding it, pipelines write the raw HTTP response body to S3. This decouples the fragile extraction logic from the expensive fetching logic. If an extraction job fails due to a site layout change, the raw data is safely persisted in S3 and can be re-processed without sending new HTTP requests to the target.
03Partitioning strategies
S3 does not have directories; it has a flat namespace of keys. However, using slashes in keys creates a logical hierarchy. Proper partitioning is critical for cost control. A standard scraping partition scheme looks like s3://bucket/raw/target_name/year=2026/month=05/day=19/. This allows downstream ETL jobs to read only the data for a specific day, rather than scanning the entire petabyte-scale bucket.
04How DataFlirt handles it
We treat S3 as the immutable source of truth for all pipelines. Our distributed workers buffer raw responses in memory and flush them to S3 in 128MB compressed batches to avoid API cost penalties. Once written, S3 event notifications trigger our extraction workers, which pull the raw batches, apply the schema, and write structured Parquet files to the delivery bucket. We manage over 40PB of raw scraping artifacts across our fleet.
05Did you know?
Historically, S3 was "eventually consistent," meaning if you wrote an object and immediately tried to read it, you might get a 404. In late 2020, AWS updated S3 to deliver strong read-after-write consistency. However, if you are building multi-cloud scrapers, be aware that some alternative object storage providers still exhibit eventual consistency under heavy load.
// 03 — storage economics

What does
persistence cost?

Object storage is cheap per gigabyte, but API calls (PUT/GET) and egress bandwidth are not. DataFlirt optimizes pipeline economics by batching small payloads into larger objects before writing to S3.

Total Storage Cost = C = (GB × RateGB) + (PUTs / 1000 × RatePUT)
Small files inflate PUT costs exponentially. Batching is mandatory. Standard Cloud Pricing Model
Batching Efficiency = E = 1 − (Batched_PUTs / Raw_Requests)
Target E > 0.95 for high-frequency scraping to avoid API cost blowouts. DataFlirt Pipeline SLO
DataFlirt S3 Egress = Cost = 0
We route delivery through intra-region VPC endpoints to eliminate egress fees for AWS clients. DataFlirt Delivery Architecture
// 04 — the write path

Flushing 10,000
records to S3.

A trace of a DataFlirt worker batching raw JSON responses and writing a compressed Parquet object to a client's S3 bucket.

AWS S3Parquetboto3
edge.dataflirt.io — live
CAPTURED
// buffer status
buffer.records: 10000
buffer.size_raw: "14.2 MB"
buffer.state: "FLUSH_TRIGGERED"

// transform & compress
task: "convert_to_parquet"
compression: "snappy"
buffer.size_compressed: "2.8 MB"

// s3 put operation
s3.bucket: "df-client-prod-data"
s3.key: "raw/amazon/2026/05/19/batch_042.parquet"
s3.auth: "IAM_ROLE_ASSUMED"
s3.upload_id: "x89f...2a1"
s3.status: 200 OK

// metadata tagging
meta.pipeline_id: "amz-pricing-v4"
meta.record_count: 10000
s3.tagging: SUCCESS
// 05 — failure modes

Where S3 sinks
break down.

Object storage is highly durable, but pipeline architectures built on top of it often fail due to poor key design, small-file proliferation, or IAM misconfigurations.

PIPELINES MONITORED ·   300+ active
AVG BATCH SIZE ·  ·  ·    128 MB
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Small file proliferation

cost & performance · Millions of 10KB files cripple downstream Athena/Spark queries.
02

IAM / Bucket Policy errors

access denied · Cross-account role assumption failures during delivery.
03

S3 503 Slow Down

throttling · Hot-spotting a single prefix with >3,500 PUTs per second.
04

Inefficient partitioning

scan costs · Failing to partition by date/target forces full bucket scans.
05

Orphaned multipart uploads

hidden costs · Failed large uploads consuming storage without showing in LIST.
// 06 — our architecture

Write raw, extract later,

the only way to survive schema drift.

If you extract data in memory and only save the structured output, a broken CSS selector means permanent data loss. DataFlirt's architecture writes the raw HTTP response body directly to S3 as an immutable artifact. If a target site changes its layout, we update our extraction schema and replay the raw S3 objects through the new parser. You never lose data because you never throw away the source truth.

S3 Sink Telemetry

Live metrics from a high-throughput e-commerce pipeline writing to a raw data lake.

bucket.name df-raw-zone-us-east-1
write.throughput 412 MB/s
put.latency_p99 42ms
batch.efficiency 0.98
throttled_requests 0.01%
cross_account.iam assumed
delivery.status active

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About object storage architecture, pipeline economics, data delivery, and how DataFlirt manages petabyte-scale scraping sinks.

Ask us directly →
What is the difference between block storage and object storage? +
Block storage (like AWS EBS) acts like a hard drive attached to a single server—it's fast but doesn't scale across thousands of concurrent writers. Object storage (S3) stores files as discrete objects with metadata via an HTTP API. It scales infinitely and allows thousands of scraping workers to write simultaneously without locking.
Why not just write scraped data directly to PostgreSQL or MySQL? +
Relational databases require connection pooling, strict schemas, and handle concurrent writes poorly at high volume. If you have 5,000 scraping workers trying to INSERT rows simultaneously, you will exhaust your database connections instantly. S3 absorbs the massive write concurrency; you then batch-load that data into your database asynchronously.
What is the 'small file problem' in S3? +
Writing every scraped HTML page as its own 50KB object in S3 is an anti-pattern. It incurs massive API costs (you pay per 1,000 PUT requests) and destroys read performance for downstream analytics tools like Athena or Spark, which spend more time opening files than reading data. You must buffer and batch into larger files (e.g., 128MB Parquet).
How do you handle GDPR/CCPA if you store raw HTML in S3? +
Raw HTML can contain PII. We enforce strict lifecycle policies on our raw data zones—typically automatically deleting raw S3 objects after 7 to 30 days. The structured, extracted data (which is stripped of PII during the parsing phase) is persisted long-term. S3 lifecycle rules make this compliance automated and auditable.
How does DataFlirt deliver data to my company's S3 bucket? +
We use cross-account IAM role assumption. You create a role in your AWS account granting our delivery worker write-only access to a specific bucket prefix. We assume that role to write the Parquet or JSON files. No access keys are exchanged, and you retain complete ownership of the destination bucket.
Can S3 handle 100,000 requests per second from a massive crawl? +
Yes, but not out of the box on a single prefix. S3 supports 3,500 PUTs and 5,500 GETs per second per prefix. If you write everything to s3://bucket/data/, you will get 503 Slow Down errors. You must introduce entropy into your key names (e.g., hashing the target ID) to distribute the load across S3's internal partitions.
$ dataflirt scope --new-project --target=object-storage-(s3) READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h