← Glossary / Idempotent Scraping

What is Idempotent Scraping?

Idempotent scraping is a pipeline design pattern where executing the same extraction job multiple times yields the exact same final dataset state, without duplicating records or corrupting downstream tables. It decouples the act of fetching from the act of state mutation. When a worker crashes mid-run, an idempotent architecture allows you to simply restart the job from the top, knowing the delivery layer will safely overwrite or ignore previously processed records.

InfrastructureState ManagementETLFault ToleranceUpserts
// 02 — definitions

Run it once.
Run it twice.

Why the ability to safely retry a failed job without manual cleanup is the dividing line between a script and a production pipeline.

Ask a DataFlirt engineer →

TL;DR

Idempotent scraping ensures that retrying a failed or interrupted job doesn't result in duplicate data. It relies on stable primary keys derived from the source data (like a product SKU or URL hash) and upsert operations at the database layer, rather than naive appends. It makes pipeline orchestration vastly simpler because error recovery is just a blind retry.

01Definition & structure

An idempotent operation is one that produces the same result whether executed once or multiple times. In scraping, this means if a pipeline crashes at 99% completion, you don't need to write a custom script to figure out what's missing. You just run the whole pipeline again.

This requires three structural components:

  • Stable Primary Keys: Every extracted record must have a deterministic ID.
  • Stateless Workers: The scraper itself holds no memory of what it has done.
  • Upsert Sinks: The database uses INSERT ON CONFLICT UPDATE rather than naive appends.
02How it works in practice

When a scraper extracts a product, it generates a payload. Instead of just pushing JSON to a file, it assigns a unique key (e.g., store_123_sku_456). When this payload hits the database, the DB checks the index. If the key doesn't exist, it inserts the row. If the key does exist, it updates the row with the new data.

If the network drops and the scraper retries the request, the exact same payload is generated. The database sees the same key, updates the row with identical data, and the final dataset remains perfectly clean.

03The primary key problem

Idempotency breaks when primary keys are unstable. If a target website generates a dynamic session token in the URL, and you use the URL as your primary key, every retry will look like a brand new record. Your dataset will inflate with duplicates.

To fix this, you must strip volatile parameters from URLs before hashing, or better yet, derive the key entirely from the content itself (e.g., a composite of the product title and the manufacturer part number).

04How DataFlirt handles it

We enforce idempotency at the delivery boundary. Every record extracted by our fleet is passed through a schema validation layer that generates a deterministic SHA-256 fingerprint based strictly on the defined primary keys. Volatile fields like scraped_at are explicitly excluded from the hash.

Because our delivery sinks (PostgreSQL, Snowflake, BigQuery) are configured for atomic upserts based on these fingerprints, our orchestrator can aggressively kill and restart stuck worker nodes without ever risking duplicate data delivery to the client.

05Did you know?

Naive append-only scrapers (like scripts that just write to a CSV) often inflate datasets by 15-30%. This happens because HTTP clients automatically retry failed requests under the hood. If a request succeeds on the server but the response times out on the wire, the client retries, the server responds again, and the script writes the same item to the CSV twice.

// 03 — the math

Measuring pipeline
determinism.

Idempotency is measured by the divergence in output state after forced retries. DataFlirt's CI/CD pipeline runs chaos tests on every scraper to ensure zero state drift under failure conditions.

Idempotency factor = I = 1 − (Δ_records / total_records)
I = 1 means perfectly idempotent. Retries produce zero net-new duplicate rows. Data Engineering standard
Deduplication ratio = D = upserts_triggered / total_writes
High D during a retry indicates the idempotency layer is actively catching overlaps. Database metrics
DataFlirt state drift SLO = ΔS = 0
Guaranteed zero duplicates across all managed delivery sinks, regardless of worker crashes. Internal SLO
// 04 — execution trace

A mid-run crash
and a clean restart.

Trace of a worker node dying halfway through a 10,000-page catalog crawl, and the subsequent idempotent recovery run. Notice how the database handles the overlap.

KubernetesPostgreSQL UpsertKafka
edge.dataflirt.io — live
CAPTURED
// Run 1: Initial execution
worker.status: running
records.extracted: 4,512
db.operation: INSERT (4,512 new)
worker.error: OOMKilled // node ran out of memory
pipeline.state: FAILED

// Run 2: Automated retry triggered
worker.status: restarting_job
records.extracted: 10,000
db.operation: UPSERT ON CONFLICT (sku)
db.metrics: 4,512 updated, 5,488 inserted
pipeline.state: SUCCESS

// Validation
dataset.total_rows: 10,000
dataset.duplicates: 0
// 05 — failure modes

Where idempotency
breaks down.

The most common reasons a supposedly idempotent scraper ends up writing duplicate records to the data warehouse. It almost always comes down to unstable key generation.

PIPELINES ANALYSED ·  ·   850+
COMMON CAUSE ·  ·  ·  ·   Dynamic IDs
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Unstable primary keys

dynamic generation · Target site generates new internal IDs on every page load
02

Timestamp inclusion

hash corruption · Including the scrape timestamp in the deduplication hash
03

URL parameter noise

session IDs · Tracking tokens in the URL bypassing URL-based deduplication
04

Pagination drift

item shifting · Items move between pages during the crawl, causing double-fetches
05

Non-transactional writes

partial commits · Batch inserts fail halfway, leaving ghost records behind
// 06 — our architecture

Stateless workers,

stateful sinks.

DataFlirt's extraction workers are entirely stateless. They don't know if they are running for the first time or the fifth time. They simply extract records, generate a deterministic SHA-256 hash based on the schema's defined primary keys, and push to the delivery queue. The idempotency is enforced at the database boundary using atomic upserts. If a node burns down, we just spin up another one and replay the URL queue. The sink handles the overlap silently.

Idempotent Delivery Config

Delivery layer configuration for a high-volume product catalog pipeline.

sink.type PostgreSQL
write.mode UPSERT
conflict.target composite: [store_id, sku]
hash.strategy deterministic_sha256stable
exclude_from_hash ['scraped_at', 'session_id']
retry.policy infinite_replaysafe
duplicate.rate 0.00%

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about state management, deduplication, and building fault-tolerant scraping infrastructure.

Ask us directly →
What is the difference between deduplication and idempotency? +
Deduplication is a filtering process — removing duplicates that have already been created. Idempotency is a structural guarantee — ensuring the operation itself cannot create duplicates, no matter how many times it runs. Idempotency prevents the mess; deduplication cleans it up.
How do you handle sites that don't expose unique IDs? +
We derive a synthetic primary key. If a product lacks a SKU, we hash a combination of stable attributes: hash(brand + product_name + normalized_url). The key must be deterministic — it must generate the exact same hash tomorrow as it did today, ignoring volatile fields like price or stock status.
Does idempotent scraping slow down the pipeline? +
At the database layer, an INSERT ON CONFLICT (upsert) is marginally slower than a bulk append because the database must check the index. However, the operational time saved by never having to manually clean up a failed run or write custom recovery scripts vastly outweighs the microsecond database penalty.
How does DataFlirt handle idempotency for API delivery (webhooks)? +
For webhook sinks, we send a unique Idempotency-Key header with every payload, derived from the record's hash. The client's receiving server uses this key to safely ignore duplicate deliveries if our system retries a webhook due to a network timeout.
Is it legal to scrape the same page multiple times if a job fails? +
Yes, but it consumes target bandwidth. This is why we implement idempotency at the URL queue level as well. If a worker crashes, the retry queue only re-fetches URLs that didn't successfully commit to the sink, minimizing unnecessary load on the target server.
What happens if the target site updates a record between the first run and the retry? +
The upsert handles it. If the first run scraped a price of $10, crashed, and the retry scrapes $12, the database sees the primary key conflict and updates the row to $12. The final state reflects the most recent truth, which is exactly what a data consumer wants.
$ dataflirt scope --new-project --target=idempotent-scraping READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h