← Glossary / Incremental Scraping

What is Incremental Scraping?

Incremental scraping is the practice of re-fetching only URLs or records that have changed since the last run, rather than re-crawling an entire target on every cycle. It depends on a change-detection mechanism — hash comparison, Last-Modified headers, sitemap diffing, or feed polling — to identify what's stale. Without it, a 10 million-URL catalog gets re-fetched daily at full cost even when 95% of the data is unchanged.

InfrastructureDataChange DetectionDeltaScheduling

// 02 — definitions

Only fetch
what changed.

The discipline of re-crawling only the subset of pages with fresh data — and the mechanisms that tell you which subset that is.

Ask a DataFlirt engineer →

TL;DR

Incremental scraping uses change signals (ETags, content hashes, sitemap lastmod, feed deltas) to limit re-fetches to pages with new data. At scale it reduces compute and bandwidth by 80–95% versus full re-crawls. DataFlirt runs incremental diffing across all long-running pipelines as the default cadence model.

01Definition & structure

Incremental scraping separates the crawl into two phases: a lightweight change detection pass across the full URL corpus, followed by a full fetch of only the URLs identified as changed. The architecture requires a persistent state store — the hash index — that records the last-seen content fingerprint for every URL in the corpus.

url corpus — full set of known URLs for the target, maintained in a persistent store
change detector — HEAD requests, ETag comparison, or content hash diff to identify stale URLs
fetch queue — only changed URLs enter the full scrape pipeline
hash index — updated after each successful fetch; the source of truth for what's current
delta output — downstream sink receives only inserts, updates, and deletes — not a full re-dump

02How it works in practice

On each scheduled run, the pipeline issues a HEAD request to every URL in the corpus and compares the response ETag or Last-Modified header against the stored value. Where headers are unreliable — which is most e-commerce sites — it fetches the lightweight version of the page, normalizes the HTML (stripping dynamic noise like ad slots and CSRF tokens), and computes a SHA-256 hash. Only URLs where the hash has changed enter the full Playwright render queue. The result is a fetch queue that's typically 5–10% of the corpus, keeping render costs proportional to actual content change rate rather than catalog size.

03The hash index — why it's the critical state

The hash index is the operational core of any incremental pipeline. It maps each URL to its last-seen content hash, the timestamp of the last fetch, and the last-seen structured field values. Without a reliable hash index, every run is effectively a full re-crawl. The index must be consistent (no partial writes), recoverable (snapshottable for disaster recovery), and fast enough to handle millions of lookups within the detection window. DynamoDB with on-demand capacity and a 90-day TTL is the standard pattern — it handles the read-heavy detection pass cheaply and writes only on actual content changes.

04How DataFlirt handles it

We run incremental by default on every pipeline with a corpus above 10,000 URLs. Our hash index runs on DynamoDB with per-field granularity — we track changes at the field level, not just the page level, so a price update triggers a targeted re-fetch of that product's price field without re-rendering the entire page DOM. Clients get a structured delta stream: each message contains the changed fields, previous and current values, and the fetch timestamp. We publish a full snapshot weekly as a consistency backstop.

05Common misconception: incremental scraping is just scheduling

Running your scraper every hour instead of every day is not incremental scraping — it's just a faster full re-crawl. True incremental scraping requires state: a persistent record of what you already have so you can compute a diff. Without the hash index, you're paying full fetch cost on every URL every time, just more frequently. The efficiency gains — typically 80–95% cost reduction — only materialise when the change detection pass is cheaper than the full fetch and the index is accurate enough to avoid missed changes.

// 03 — the efficiency model

How much work
you actually skip.

Incremental efficiency is the fraction of the corpus that changes between runs. DataFlirt's pipeline planner uses this to set crawl budgets and schedule cadences — low-churn catalogs run weekly full sweeps, high-churn price feeds run minute-level incremental checks.

Incremental efficiency = E = 1 − (N_changed / N_total)

A catalog with 5% daily churn has E = 0.95 — 95% of fetches are skipped each day. Internal SLO

Change detection cost = C_detect = N_total × t_head + N_changed × t_fetch

HEAD request overhead is the fixed cost; full fetches are paid only on changed pages. RFC 7232 — Conditional Requests

Staleness bound = S_max = T_cadence × (1 − P_detect)

Maximum staleness is cadence interval × miss rate on your change detector. DataFlirt freshness SLA model

// 04 — incremental run diff

240,000 URLs checked.
11,400 re-fetched.

A nightly incremental run across a large Indian e-commerce electronics catalog. Hash-based change detection against the previous snapshot determines the fetch queue.

240k URLsETag + hash diff4.75% churn

edge.dataflirt.io — live

CAPTURED

// run metadata
pipeline: "electronics-catalog-IN" cadence: "nightly"
corpus_size: 240,182 URLs

// change detection pass
method: "etag + content-hash"
head_requests_sent: 240,182 duration_s: 412
etag_mismatches: 8,940
hash_mismatches: 2,460 // ETags stale on 2,460 unchanged pages
new_urls_discovered: 312 // via sitemap diff

// fetch queue
pages_to_fetch: 11,400 // 4.75% of corpus
skipped: 228,782 // 95.25% — no fetch needed

// delivery
records_upserted: 11,400 records_deleted: 28
pipeline.status: complete total_runtime_s: 1,847

// 05 — change signals

How you know
what changed.

Incremental scraping is only as good as its change detector. These are the signals DataFlirt pipelines use, ranked by reliability across a corpus of 200+ active targets.

AVG CORPUS CHURN · · · 4–8% / day

DETECTION ACCURACY · · 97.3% (30d)

UPDATED · · · · · · 2026-05-19

01

Content hash diff

most reliable · compares SHA-256 of normalized HTML — doesn't trust server headers

02

ETag / Last-Modified

fast but imperfect · supported by ~60% of targets; ETags sometimes change without content change

03

Sitemap lastmod

catalog-level · XML sitemaps with lastmod tell you which URLs were updated by the publisher

04

Feed / API polling

near-real-time · RSS, Atom, and JSON feeds give sub-minute change signals on supported targets

05

Structured data delta

field-level · compare LD+JSON or microdata values directly — catches content changes only

// 06 — our approach

Delta pipelines,

not daily re-dumps — freshness without the compute bill.

Every DataFlirt long-running pipeline runs incremental by default. We maintain a hash index of the last-seen value per field per URL and emit only diffs to the delivery sink. Clients receive a clean changelog — inserts, updates, and deletes — rather than a full re-dump they have to diff themselves. The hash index is the core IP; rebuilding it from scratch after a gap is the most expensive thing a scraping pipeline can do.

incremental-pipeline.config.json

Change-detection configuration for a high-volume catalog pipeline.

detection.method content-hash + etagfallback: sitemap

hash.algorithm SHA-256 · normalized HTML

index.backend DynamoDB · on-demandTTL: 90d

churn.p50 4.8% / daywithin SLA

delivery.format delta (insert/update/delete)full snapshot: weekly

staleness.max < 6 h for price fields

pipeline.status active · 99.7% uptime

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About incremental scraping, change detection reliability, staleness guarantees, and how DataFlirt manages delta delivery at scale.

Ask us directly →

What happens if the change detector misses a real change? +

A detection miss means you deliver stale data. Our defense is a layered detector: ETag first (fast), content hash second (authoritative), sitemap lastmod third (catalog-level). The combination gets to 97%+ detection accuracy across most targets. For high-stakes fields like pricing, we add a scheduled full sweep every 24 hours as a backstop.

Can you do incremental scraping on a site that doesn't support ETags? +

Yes — ETag is a convenience, not a requirement. Content hashing works on any site regardless of what HTTP headers it sends. We normalize the HTML before hashing (stripping timestamps, CSRF tokens, ad slots) to avoid false positives from dynamic page regions that change on every load.

How do you handle newly added URLs — won't incremental scraping miss them? +

URL discovery runs as a separate process: sitemap diffs, category page crawls, and internal link traversal on each full sweep. New URLs enter the hash index with no prior record, triggering a full fetch. The incremental savings come from the 90%+ of the corpus that's stable between runs.

What does the delivery output look like for an incremental pipeline? +

A delta feed: each record is tagged as INSERT, UPDATE, or DELETE with the changed fields, the previous value, and a timestamp. Clients can apply it as an upsert against their own store rather than re-ingesting the full dataset. For teams who prefer full snapshots, we also publish a weekly reconstituted full export.

How do you decide whether to run a full re-crawl versus incremental? +

We use the corpus churn rate. If a target is above 40% daily churn, incremental overhead (HEAD requests + hash computation) starts to approach the cost of a full fetch, and a full re-crawl is simpler to operate. Below 20% churn, incremental always wins. Between 20–40%, it depends on page render time.

What's the staleness guarantee on an incremental pipeline? +

It depends on the cadence and the detection miss rate. For a nightly run with a 97% detection rate, worst-case staleness is 24 hours on the 3% of missed changes. For price-sensitive feeds we run shorter cadences — hourly or 15-minute incremental checks — with staleness guarantees written into the delivery SLA.

$ dataflirt scope --new-project --target=incremental-scraping READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h