← Glossary / Referential Integrity Check

What is Referential Integrity Check?

Referential integrity check is a validation process in data engineering that ensures relationships between tables remain consistent after an extraction run. In scraping pipelines, it verifies that every child record (like a product review) maps to a valid parent record (like a product ID) that actually exists in the dataset. Without it, partial scrapes and pagination failures silently create orphaned data, breaking downstream joins and corrupting analytics dashboards.

Data QualityValidationForeign KeysOrphaned Recordsdbt
// 02 — definitions

No orphans
allowed.

Why scraping relational data from the web requires strict foreign key validation before hitting the warehouse.

Ask a DataFlirt engineer →

TL;DR

A referential integrity check guarantees that foreign keys in your scraped data point to existing primary keys. It's the primary defense against partial extraction failures — like scraping 10,000 reviews for products that were skipped due to a timeout. If the check fails, the pipeline halts or quarantines the orphans before they pollute the gold layer.

01Definition & structure
A referential integrity check is a data quality test that ensures relationships between different entities in a dataset are valid. In a relational database, this is enforced by foreign key constraints. In a scraping pipeline, it is a post-extraction validation step that confirms every child record (e.g., a comment, a review, an employee) has a corresponding parent record (e.g., a post, a product, a company) in the final dataset.
02How it works in practice
Scraping relational data usually requires multiple jobs: one to scrape the index (parents) and another to scrape the details (children). Before the data is delivered to the client or merged into the production warehouse, a validation script (often using tools like dbt or Great Expectations) runs a `LEFT ANTI JOIN` between the child and parent tables. Any child record that doesn't match a parent is flagged as an orphan.
03Common failure modes
Integrity breaks when pipelines fail silently. If a parent scraping job hits a CAPTCHA on page 50 and stops, but the child job continues using cached URLs, you will extract thousands of child records with no parent. Similarly, if the target website deletes a product while your crawler is traversing its reviews, the resulting dataset will contain orphaned reviews.
04How DataFlirt handles it
We enforce strict data contracts. Our extraction DAGs are structured so that child jobs dynamically inherit their target URLs directly from the successful output of the parent job. During the delivery phase, automated dbt tests verify foreign key relationships. Any orphaned records are automatically routed to a dead-letter queue, which triggers a targeted micro-scrape to fetch the missing parent data before the client sync completes.
05The cost of ignoring integrity
Without referential integrity checks, orphaned data silently corrupts downstream analytics. An `INNER JOIN` in a BI tool will drop the orphaned records entirely, leading to underreported metrics. A `LEFT JOIN` will include them with null parent dimensions, breaking aggregations and dashboards. Finding and fixing these errors post-delivery is exponentially more expensive than quarantining them at the edge.
// 03 — the validation math

How many records
are orphaned?

Referential integrity is measured as a strict boolean at the database level, but in scraping pipelines, we track the orphan rate to detect partial extraction failures.

Orphan Rate = O = 1 − (matched_child_records / total_child_records)
O > 0 triggers a pipeline quarantine. Every child must have a parent. Data Quality SLO
Parent Completeness = C = distinct_foreign_keys / expected_parent_ids
Detects if the parent scraping job failed silently while the child job succeeded. Pipeline Observability
DataFlirt Quarantine Threshold = Q = O > 0.001
99.9% strict integrity required for production delivery. Orphans are backfilled. Internal Delivery Standard
// 04 — pipeline validation trace

Catching orphans
before the warehouse.

A dbt test execution on a scraped e-commerce dataset, validating that all extracted reviews map to a valid product ID in the dimension table.

dbt testforeign key checkquarantine
edge.dataflirt.io — live
CAPTURED
// running dbt tests on staging layer
$ dbt test --select models/staging/stg_reviews.sql

test_not_null_stg_reviews_review_id: PASS
test_unique_stg_reviews_review_id: PASS
test_relationships_stg_reviews_product_id__stg_products_product_id: FAIL

// failure analysis
orphaned_records: 412
missing_parent_ids: ["B08F7PTF53", "B08F7PTF54", ...]
root_cause: "stg_products job timed out on page 42"

// quarantine protocol
action: moving 412 records to dead_letter_queue
trigger: parent_backfill_job
pipeline_status: WARN
// 05 — failure modes

Why scraped data
loses integrity.

Unlike transactional databases, web scraping pulls data from distributed, eventual-consistent systems. These are the most common reasons foreign key relationships break during extraction.

PIPELINES MONITORED ·   300+ active
VALIDATION ·  ·  ·  ·  ·  per run
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Partial parent scrape

89% of failures · Parent job blocked or timed out, child job succeeded
02

Asynchronous site updates

72% of failures · Parent deleted by target while scraping children
03

Pagination limits

64% of failures · Target caps parent list at 1000, exposes all children
04

ID format changes

41% of failures · A/B testing different SKU formats breaks joins
05

Cache staleness

28% of failures · CDN serves old parent page, new child page
// 06 — DataFlirt's validation layer

Enforce at the edge,

quarantine in the warehouse.

DataFlirt treats referential integrity as a strict data contract. When scraping hierarchical data — like company profiles and their associated employees — we don't just dump flat files. Our delivery layer runs cross-table validation before the sync completes. If a child record lacks a parent, it is routed to a quarantine table for backfilling, ensuring your production tables never violate foreign key constraints.

Integrity Check Status

Live validation of a B2B company-employee extraction job.

job.id extract-b2b-emp-09
parent_table dim_companiesok
child_table fact_employeesok
records.child 142,850
records.orphaned 14
action quarantine_and_backfill
delivery.status clean_sync_complete

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About relational data scraping, handling orphaned records, and maintaining strict data contracts at scale.

Ask us directly →
What is the difference between referential integrity and data completeness? +
Completeness is about having all the fields or records you expect. Referential integrity is strictly about relationships — ensuring every foreign key points to a valid primary key. You can have 100% integrity but only 10% completeness if you missed 90% of both parents and children. They measure different failure modes.
Why do referential integrity checks fail so often in web scraping? +
Because the web isn't an ACID-compliant database. You might scrape a list of products on Monday, and scrape their reviews on Tuesday. If a product was deleted on Monday night, Tuesday's reviews will be orphaned. Scraping is inherently eventual-consistent.
Should I drop orphaned records or keep them? +
Never drop them silently. Quarantine them. An orphaned record often indicates a failure in the parent scraping job. By keeping them in a dead-letter table, you can trigger a targeted backfill to fetch the missing parent data, recovering the full relational structure.
How does DataFlirt handle referential integrity across massive datasets? +
We use DAG-based extraction schedules. Child jobs cannot commence until the parent job successfully commits its primary keys to the staging area. Before final delivery, a dbt-based validation layer runs foreign key checks across the entire payload, quarantining any anomalies.
Can I enforce referential integrity if the target site uses dynamic IDs? +
Yes, but it requires entity resolution. If the site rotates internal IDs, we generate a stable synthetic key based on immutable attributes (like URL slug or product name) and use that synthetic key to enforce relationships in the delivered dataset.
Does GDPR require referential integrity? +
Indirectly, yes. The GDPR accuracy principle requires personal data to be accurate and kept up to date. If a user deletes their account (parent), but your system retains their orphaned comments (children) because of a broken integrity check, you are likely violating data retention and accuracy mandates.
$ dataflirt scope --new-project --target=referential-integrity-check READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h