← Glossary / Data Reconciliation

What is Data Reconciliation?

Data reconciliation is the automated process of verifying that the dataset produced by a pipeline exactly matches the source of truth. In web scraping, it bridges the gap between "the scraper ran successfully" and "the data is actually complete and correct." Without it, schema drift and pagination failures manifest as silent data loss, corrupting downstream analytics before anyone notices.

Data EngineeringData QualityValidationETLCompleteness
// 02 — definitions

Trust, but
verify.

The mathematical and operational checks that ensure your extracted records haven't drifted from the target's reality.

Ask a DataFlirt engineer →

TL;DR

Data reconciliation compares output datasets against expected baselines using row counts, aggregate checksums, and spot-check sampling. It's the final gate in a production scraping pipeline. If a target site claims 50,000 products in a category but the pipeline only delivers 42,000, reconciliation blocks the delivery and flags the run for engineering review.

01Definition & structure

Data reconciliation is the phase of a data pipeline where the output is mathematically compared against an expected baseline. In web scraping, it proves that the extraction process didn't silently drop records or mangle values.

A robust reconciliation layer checks three things:

  • Completeness: Did we extract the expected number of rows?
  • Accuracy: Do the extracted values match reality (via spot checks)?
  • Consistency: Are the aggregate metrics (like average price or category distribution) stable compared to previous runs?
02How it works in practice

Reconciliation happens after extraction but before delivery. The raw data is loaded into a staging environment where automated tests run. These tests compare the new dataset against historical metadata (e.g., "yesterday we had 10k rows, today we have 10.1k") and against deterministic targets scraped from the site itself (e.g., extracting the "Total Results: 5,432" badge and asserting that the final row count equals 5,432).

03Macro vs Micro reconciliation

Macro reconciliation looks at the forest: total row counts, file sizes, and aggregate sums. It catches catastrophic failures like a broken pagination loop that stops after page 2. Micro reconciliation looks at the trees: row-by-row comparisons, null-rate checks on specific columns, and statistical distribution of values. It catches subtle failures, like a CSS selector change that causes all product descriptions to be extracted as empty strings.

04How DataFlirt handles it

We treat reconciliation as a hard gate. Every pipeline run is evaluated against a strict set of statistical bounds. If a dataset fails reconciliation, it is automatically quarantined. It never reaches your S3 bucket or Snowflake instance. Our engineering team is alerted, we diagnose the root cause (usually a site layout change), patch the scraper, and re-run the job. You only consume verified data.

05The silent failure problem

The most dangerous scraping failures don't throw HTTP 500s or crash the script. They return HTTP 200 OK and perfectly valid JSON — it's just missing 40% of the data because a "Load More" button changed its class name. Without a reconciliation layer, this truncated dataset overwrites your production tables, and your business intelligence dashboards start reporting a massive, fictitious drop in competitor inventory.

// 03 — the math

How do you measure
data fidelity?

Reconciliation relies on statistical bounds and historical baselines. DataFlirt uses these formulas to automatically quarantine anomalous pipeline runs before they reach client S3 buckets.

Completeness Ratio = C = records_extracted / expected_baseline
Triggers an alert if C < 0.99 or C > 1.05. The baseline is derived from historical runs or on-page metadata. DataFlirt pipeline SLO
Aggregate Drift = Δ = |μ_currentμ_historical| / σ_historical
Z-score for numeric fields (e.g., average price) to catch currency parsing errors or massive catalog shifts. Statistical process control
Spot-Check Accuracy = A = 1 − (failed_assertions / sampled_records)
Minimum 99.9% required for financial datasets. Evaluates specific fields against known-good values. Data quality standards
// 04 — pipeline execution

A failed reconciliation
job in real time.

A daily e-commerce pricing run completes, but the reconciliation layer catches a pagination failure before the corrupted dataset is delivered.

dbt testschema validationquarantine
edge.dataflirt.io — live
CAPTURED
// phase 1: extraction complete
job.status: 200 OK
records.extracted: 142,850

// phase 2: macro reconciliation
baseline.expected: 155,000 ± 2%
check.row_count: FAIL (deviation: -7.8%)

// phase 3: micro reconciliation (spot checks)
check.price_null_rate: PASS (0.01%)
check.category_distribution: FAIL
↳ category 'Electronics' missing 90% of expected volume

// phase 4: action
delivery.s3: ABORTED
dataset.status: QUARANTINED
alert.pagerduty: "Pagination selector drift suspected on /category/electronics"
// 05 — failure modes

Why datasets fail
reconciliation.

The most common root causes for data mismatch in web scraping pipelines, ranked by frequency across DataFlirt's managed infrastructure.

PIPELINES MONITORED ·   300+ active
WINDOW ·  ·  ·  ·  ·  ·   30d trailing
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Pagination logic breaks

% of failures · Silent truncation of lists due to UI changes
02

Conditional DOM rendering

% of failures · Missing fields on specific product variants
03

Anti-bot soft blocks

% of failures · Target returns 200 OK with empty or fake data
04

Type coercion errors

% of failures · Prices parsed as null due to new currency symbols
05

Target-side data deletion

% of failures · The site actually removed 10% of its catalog
// 06 — our architecture

Never deliver bad data,

even when the target site changes.

DataFlirt's reconciliation engine sits between the extraction workers and the client delivery sink. Every dataset is evaluated against historical baselines, schema contracts, and deterministic target metrics (like category count badges). If a run fails reconciliation, it is quarantined. We investigate, patch the scraper, and backfill the data. You only receive data that you can mathematically trust.

Reconciliation Gate

Live validation metrics for a B2B pricing feed.

pipeline.id feed-b2b-pricing-09
records.total 84,211
baseline.deviation +0.4%
schema.compliance 100%
anomaly.price_var 0.02σ
gate.status PASSED
delivery.target s3://client-bucket/prod/

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About data validation, anomaly detection, handling false positives, and how DataFlirt ensures dataset fidelity at scale.

Ask us directly →
What is the difference between data validation and data reconciliation? +
Validation checks if the data is formatted correctly (e.g., "is the price a number?"). Reconciliation checks if it is the right data compared to the source (e.g., "did we get all 500 products, and is the average price consistent with yesterday?"). A dataset can be perfectly valid but completely unreconciled if the scraper missed half the pages.
How do you reconcile when the target site's total size is unknown? +
We use historical baselines and rolling averages. If a pipeline extracted 10,000 records yesterday, 10,100 today is expected variance. If it extracts 4,000 today, that's a massive red flag. We also extract on-page metadata — like "Showing 1-20 of 4,592 results" — and use that number as a deterministic reconciliation target for the run.
What is macro vs micro reconciliation? +
Macro reconciliation looks at the dataset as a whole: total row counts, file sizes, and aggregate sums. Micro reconciliation looks at row-level fidelity: spot-checking specific fields, verifying checksums, and ensuring referential integrity between related tables.
How does DataFlirt handle false positives in reconciliation? +
If a target legitimately deletes half their catalog, our engine will flag the run and quarantine the data. An on-call engineer manually verifies the site change, updates the historical baseline to reflect the new reality, and releases the quarantine. We prefer a delayed delivery over delivering corrupted data.
Can reconciliation detect anti-bot honeypots? +
Yes. If a scraper hits a tarpit that returns fake data, the schema validation might pass, but value distributions will fail statistical reconciliation. For example, if a honeypot returns $9.99 for every product, the standard deviation of the price column drops to zero, immediately triggering a quarantine.
At what scale does manual spot-checking become impossible? +
Past about 10,000 records, human review is purely performative. You must rely on automated assertions, dbt tests, and statistical anomaly detection. Manual spot-checking should only be used to verify the automated rules, not to verify the dataset itself.
$ dataflirt scope --new-project --target=data-reconciliation READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h