← Glossary / Data Unit Testing

What is Data Unit Testing?

Data unit testing is the practice of applying software engineering testing principles to data pipelines. Instead of testing code logic, you test the data payload itself—asserting that fields match expected types, values fall within statistical bounds, and referential integrity holds before the data is merged into production. In scraping pipelines, where the source schema is outside your control, unit testing is the only barrier between silent upstream changes and downstream dashboard corruption.

Data QualityPipeline ContractsSchema ValidationETLQuarantine
// 02 — definitions

Asserting the
payload.

Why testing the scraper's code isn't enough when the target website can change its DOM at any moment.

Ask a DataFlirt engineer →

TL;DR

Data unit testing treats data as code. It runs assertions on every extracted batch—checking for null rates, type coercion, and value distributions. If a test fails, the batch is quarantined, preventing poisoned data from silently corrupting your data warehouse or triggering false alerts in downstream ML models.

01Definition & structure
Data unit testing is the automated validation of data payloads against a predefined set of rules or a data contract. Unlike traditional software tests that validate logic, data tests validate state. A typical test suite checks for:
  • Completeness: Are mandatory fields (like SKUs or URLs) present?
  • Type conformance: Is the price a float? Is the date in ISO 8601 format?
  • Value distribution: Is the price between $1 and $10,000? Are there negative inventory counts?
  • Uniqueness: Are there duplicate primary keys in the batch?
These tests run on the extracted data before it is loaded into a warehouse or delivered to a client.
02Why scraping requires data tests
In internal software development, you control the database schema. In web scraping, you are extracting data from a third-party DOM that can change without notice. A site might redesign its pricing block, causing your CSS selector to extract the string "Out of Stock" instead of a numeric price. The scraper code executes flawlessly, the HTTP request returns a 200 OK, but the data is garbage. Data unit tests are the only mechanism to catch this silent failure before it breaks downstream analytics.
03The quarantine pattern
When a data unit test fails, the pipeline must decide how to handle the offending record. The best practice is the quarantine pattern. Valid records are passed through to the delivery sink, while invalid records are routed to a separate quarantine storage bucket (e.g., a dead-letter queue). This allows the pipeline to continue delivering value while preserving the broken records for engineering review and replay.
04How DataFlirt handles it
We enforce data contracts at the edge. Every extraction worker runs a lightweight validation schema against the records it parses in memory. If a record fails, it is tagged with the specific assertion error and quarantined. If the failure rate for a specific field exceeds a dynamic threshold (usually 1-5%), the worker halts and alerts our on-call engineers. This guarantees that our clients never receive a dataset corrupted by upstream schema drift.
05The "Empty String" misconception
A common mistake in data unit testing is treating empty strings ("") and null as the same thing. In scraping, they mean entirely different things. An empty string usually means the selector found the element, but the element had no text content. A null means the selector failed to find the element entirely. Data tests must distinguish between the two, as a spike in nulls indicates a broken scraper, while a spike in empty strings indicates a change in the target's data population behavior.
// 03 — the assertions

How strict should
the tests be?

Test strictness is a balance between data quality and pipeline throughput. DataFlirt tunes these thresholds per field based on downstream consumer requirements, ensuring we catch drift without halting pipelines for trivial anomalies.

Null threshold = Tnull = null_count / total_records
Alert if T > historical median + 2σ. Catches silent selector drift. DataFlirt extraction SLO
Type conformance = Ctype = valid_cast_count / total_records
Must be 1.0 for primary keys and numeric metrics. Strings masquerading as floats break aggregations. Standard data contract
DataFlirt Quarantine Rate = Q = failed_tests / total_tests_run
Target Q < 0.05% for stable targets. High Q indicates target site structure changes. Internal pipeline metrics
// 04 — test execution trace

Catching schema drift
before delivery.

A live trace of a data unit test suite running against a freshly extracted batch of e-commerce product records. The tests catch a silent DOM change that caused stock status to drop.

Great Expectationspytestbatch validation
edge.dataflirt.io — live
CAPTURED
// running data unit tests on batch_id: 8f92a
target: "s3://df-raw-zone/batch_8f92a.parquet"
records_scanned: 14,200

// executing assertions
test_price_is_numeric ................. PASSED
test_currency_code_in_iso4217 ......... PASSED
test_sku_is_unique .................... PASSED
test_stock_status_not_null ............ FAILED

// failure details
field: "stock_status"
expectation: "expect_column_values_to_not_be_null"
observed_null_rate: 0.14 // threshold: 0.01

// pipeline action
action: "quarantine_batch"
destination: "s3://df-quarantine/batch_8f92a/"
status: HALTED // awaiting manual review
// 05 — failure modes

What data unit
tests actually catch.

Ranked by frequency across DataFlirt's extraction pipelines. These are the silent failures that traditional code tests miss because the scraper executed perfectly—it just extracted garbage.

TESTS RUN ·  ·  ·  ·  ·   12M+ daily
QUARANTINE RATE ·  ·  ·   0.04%
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Unexpected nulls

selector drift · Target site changed a class name, field is now empty
02

Type coercion failure

format change · Price string includes text like 'Call for price'
03

Out-of-bounds values

parsing error · Decimal separator missed, price is 100x higher
04

Referential integrity break

missing parent · Product extracted without its category hierarchy
05

Stale data

cache hit · Timestamp field hasn't advanced since last run
// 06 — our architecture

Test at the edge,

quarantine before the warehouse.

DataFlirt runs data unit tests inline during the extraction phase, not as a post-load batch job. Every record is evaluated against a versioned data contract. If a field fails validation, the specific record is quarantined while the rest of the batch proceeds, or the entire batch is halted if the failure rate exceeds the anomaly threshold. This prevents bad data from ever reaching your Snowflake or BigQuery instances.

test_suite_execution

Live metrics from an inline data test suite on a real estate pipeline.

contract.version v2.4.1active
records.scanned 14,200
tests.executed 85,200
tests.passed 85,19299.99%
tests.failed 8type_error
quarantined 8 recordsisolated
batch.status delivered_partial

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about implementing data unit tests, handling failures, and defining data contracts for scraping pipelines.

Ask us directly →
What is the difference between code unit testing and data unit testing? +
Code unit testing checks if your scraper logic works (e.g., does the pagination function increment the URL correctly). Data unit testing checks if the output payload is valid (e.g., is the extracted price actually a positive float). In scraping, your code can work perfectly while extracting completely wrong data because the target website changed.
Should a single failed record halt the entire pipeline? +
Usually, no. We use a threshold-based approach. If 5 records out of 10,000 fail a type check, we quarantine those 5 records and deliver the 9,995 good ones. If 4,000 records fail, it indicates a systemic selector break, and we halt the batch to prevent delivering a fundamentally flawed dataset.
What tools are used for data unit testing? +
Industry standards include Great Expectations, Soda Core, and dbt tests. For high-throughput scraping pipelines, DataFlirt often uses custom, lightweight Rust or Python validation layers inline during extraction to minimize memory overhead and latency before writing to Parquet.
How do you test for data freshness? +
By asserting against timestamp fields. A data unit test can check that the last_updated field is within the expected SLA window (e.g., > NOW() - 24h). If a target site serves a stale cached page, the scraper will succeed, but the data test will catch the outdated timestamp.
Who is responsible for writing data unit tests? +
Data engineers and consumers define the data contract—what the schema should look like and what values are acceptable. The scraping infrastructure team implements the tests to enforce that contract at the edge.
How does DataFlirt handle schema drift caught by these tests? +
When a test fails and quarantines a batch, it triggers an automated alert to our engineering team. We inspect the quarantined records, identify the DOM change on the target site, update the extraction selectors, and replay the raw HTML through the fixed parser to recover the data without re-scraping.
$ dataflirt scope --new-project --target=data-unit-testing READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h