← Glossary / Backfilling

What is Backfilling?

Backfilling is the process of retroactively processing or re-extracting historical data to populate a dataset. In scraping pipelines, this usually happens when a new field is added to the schema, a pipeline suffers an outage, or a client requests historical context. Because fetching from the live web is rate-limited and historical states disappear, production-grade backfills rely on parsing archived raw payloads rather than re-issuing millions of HTTP requests.

Data EngineeringETLHistorical DataIdempotencySchema Evolution
// 02 — definitions

Filling the
gaps.

Why pipelines inevitably need to look backward, and the mechanics of retroactively populating datasets without breaking current production flows.

Ask a DataFlirt engineer →

TL;DR

Backfilling allows data teams to recover missing records or extract newly defined fields from historical data. It separates the act of fetching from the act of extracting. By maintaining a raw data zone (like an S3 bucket of HTML payloads), engineers can run massive backfill jobs at maximum compute speed without triggering target site rate limits.

01Definition & triggers

Backfilling is the execution of data pipeline logic against historical data. In web scraping, it is almost always triggered by one of three events:

  • Schema expansion: A business user requests a new field (e.g., "shipping weight") that was present on the page but previously ignored by the extractor.
  • Outage recovery: A bug in the extraction logic caused a week of data to be written as null, and the records need to be re-processed.
  • New client onboarding: A new customer buys a data feed but requires two years of historical context to train their models.
02The fetch vs. extract divide

A naive scraping script fetches a URL, parses the HTML, saves the JSON, and throws the HTML away. If you need to backfill, you have to re-fetch the URL. This is disastrous at scale: you will hit rate limits, incur massive proxy costs, and likely find that the historical data is no longer on the live site.

Production pipelines decouple fetching from extraction. The fetcher writes raw HTML to a cheap object store (the raw data zone). The extractor reads from that store. Backfilling simply means pointing the extractor at older folders in the object store.

03The idempotency requirement

Backfills operate on massive datasets and take hours or days to run. They will occasionally fail due to node crashes or network blips. If your pipeline uses simple INSERT statements, restarting a failed backfill will result in duplicate records.

Backfill pipelines must be idempotent. They must use UPSERT (Insert on Conflict) or MERGE statements keyed on a unique identifier (like a product ID + timestamp). This ensures that running a backfill over the same date range multiple times results in the exact same final database state.

04How DataFlirt handles it

We treat raw HTML as an immutable ledger. Every successful fetch across our fleet is compressed and written to S3. When a client requests a backfill, we spin up a dedicated Apache Spark or Ray cluster that reads directly from S3, applies the updated extraction schema, and writes the structured output to a staging table.

Because this process never touches the public internet, it runs at the speed of cloud networking—often processing millions of historical pages in minutes without spending a single cent on proxy bandwidth.

05The schema drift problem

The hardest part of backfilling isn't infrastructure; it's time travel. If you run today's CSS selectors against HTML from two years ago, they will likely fail because the target website redesigned its layout.

To successfully backfill deep history, you must maintain a versioned registry of extractors. The backfill orchestrator must inspect the timestamp of the archived payload and route it to the specific version of the extraction logic that was valid on that date.

// 03 — the math

Calculating
backfill cost.

Backfilling isn't free. It requires compute, storage I/O, and sometimes network egress. DataFlirt models backfill operations to ensure they don't starve live extraction jobs or blow up cloud budgets.

Backfill duration = T = records / (workers × throughput)
Time to completion scales linearly with compute, bounded only by storage read limits. Standard ETL sizing
Storage I/O cost = C = (archive_size_gb × $0.02) + compute
Reading from cold storage (e.g., S3 Standard-IA) incurs retrieval fees. AWS pricing models
Completeness yield = Y = successful_extracts / total_archived_payloads
Yield drops on older data due to historical schema drift. DataFlirt pipeline SLOs
// 04 — backfill execution trace

Re-extracting
30 days of history.

A trace of a backfill job triggered by a schema update. The pipeline re-processes archived raw HTML to extract a newly requested 'manufacturer_part_number' field without hitting the target site.

S3 raw archiveschema v8batch processing
edge.dataflirt.io — live
CAPTURED
// job initialization
job.id: "bf-mfg-parts-202605"
source.archive: "s3://df-raw-zone/mfg-catalog/2026-04/"
target.schema: "v8.1" // added manufacturer_part_number

// execution phase
workers.allocated: 120
payloads.discovered: 4,192,000
read.throughput: "1.2 GB/s"

// extraction & validation
records.processed: 4,192,000
field.extracted: 4,081,500 // 97.3% yield
schema.drift_errors: 110,500 // old DOM structure in week 1

// sink
write.mode: "upsert" // idempotent merge
destination: "snowflake://mfg_db/catalog_history"
status: COMPLETED duration: "14m 22s"
// 05 — failure modes

Where backfills
break down.

Ranked by frequency across DataFlirt's historical data operations. The biggest risk isn't compute capacity—it's state corruption and schema drift in the archived data.

BACKFILLS RUN ·  ·  ·  ·  1,200+ monthly
AVG YIELD ·  ·  ·  ·  ·   96.4%
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Historical schema drift

% of failures · Current selectors fail on 6-month-old HTML
02

Non-idempotent writes

% of failures · Job restarts cause duplicate rows in the warehouse
03

Missing raw archives

% of failures · Gaps in the raw data zone prevent recovery
04

Compute starvation

% of failures · Massive backfills steal resources from live jobs
05

Type coercion failures

% of failures · Old data formats violate new strict schema types
// 06 — our architecture

Archive everything,

re-extract anything.

DataFlirt strictly separates the fetch layer from the extraction layer. We store raw HTML and JSON payloads in a cold S3 archive before any parsing happens. When a client requests a new field or needs to recover from a downstream pipeline failure, we don't re-scrape the target site. We spin up a distributed backfill job against the archive. This guarantees zero additional rate-limit risk, prevents IP bans, and allows us to deliver years of historical data in hours.

Backfill job status

Live metrics from a historical extraction run on a real estate dataset.

job.id bf-re-hist-099
source.archive s3://df-raw/re-listings/
records.processed 18.4M100%
drift.errors 12,400quarantined
idempotency.check merge on listing_idsafe
output.sink delta_lake_prod
status completed

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about historical data, idempotency, archive storage, and how DataFlirt executes massive backfills.

Ask us directly →
What is the difference between backfilling and historical scraping? +
Backfilling usually implies processing existing raw data that you already collected and stored in an archive. Historical scraping means trying to find old data on the live web (e.g., navigating to page 500 of a blog). Historical scraping is often impossible because targets delete or overwrite old state; backfilling is deterministic because you own the archive.
How do you handle schema changes in a 3-year backfill? +
With versioned selectors. A target site's DOM in 2023 is different from its DOM in 2026. If you run a 2026 extractor against 2023 HTML, it will return nulls. You must maintain a registry of historical selectors mapped to date ranges, so the backfill job applies the correct extraction logic to the corresponding payload.
Does backfilling impact live pipeline performance? +
It shouldn't, provided your architecture decouples compute and storage. At DataFlirt, backfill jobs run on dedicated worker node pools that scale independently from the live scraping fleet. The only shared resource is the destination data warehouse, which we protect using rate-limited bulk inserts or staging tables.
How does DataFlirt store raw data for backfills? +
We store raw payloads (HTML, JSON, XML) in S3, compressed as Zstandard (zstd) or Gzip, and partitioned by target, year, month, and day. This allows backfill jobs to efficiently read only the specific time slices required, minimizing S3 GET requests and data transfer costs.
What is idempotency and why does it matter here? +
Idempotency means an operation produces the same result whether it's run once or a thousand times. If a backfill job fails halfway through and you restart it, a non-idempotent pipeline will insert duplicate rows. We use UPSERTs (merge on primary key) to ensure backfills safely overwrite existing records without duplicating them.
Can I backfill data if I wasn't saving the raw HTML? +
No. You can only extract what you saved. If your pipeline only saves the parsed JSON and discards the raw HTML, any unparsed fields are gone forever. This is why saving raw payloads to a cheap cold-storage tier is the cheapest insurance policy a data engineering team can buy.
$ dataflirt scope --new-project --target=backfilling READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h