← Glossary / Data Validity

What is Data Validity?

Data validity is the degree to which extracted records conform to the defined structural, typological, and business rules of your schema. Unlike accuracy — which measures if a value is true to the real world — validity only cares if the value is formatted correctly and falls within acceptable bounds. In a scraping pipeline, failing to enforce validity at the extraction layer means you are quietly writing poisoned types into your data warehouse, breaking downstream aggregations.

Data CleaningSchema EnforcementType CoercionETLData Quality
// 02 — definitions

Format over
fact.

Why a perfectly accurate string extracted from a webpage can still be an invalid record that crashes your downstream analytics.

Ask a DataFlirt engineer →

TL;DR

Data validity ensures that scraped data matches expected formats, types, and ranges before it hits storage. A price field containing "Contact for Quote" might accurately reflect the webpage, but it is invalid if your schema expects a numeric float. Enforcing validity at the edge prevents schema drift from corrupting historical datasets.

01Definition & structure

Data validity is the enforcement of structural and typological rules on extracted data. A valid record is one that perfectly matches the expectations of the downstream consumer. Validation typically occurs across three dimensions:

  • Type constraints: Ensuring a price is a float, a date is an ISO-8601 string, and a boolean is true/false.
  • Range constraints: Ensuring numeric values fall within logical bounds (e.g., a percentage between 0 and 100).
  • Format constraints: Ensuring strings match expected patterns (e.g., regex for emails, SKUs, or phone numbers).
02Accuracy vs. Validity

These two concepts are often conflated but are entirely distinct. A scraper might extract the string "Free" from a pricing page. If the product is indeed free, the data is 100% accurate. However, if your database schema defines the price column as a DECIMAL, the data is invalid. Conversely, if a scraper extracts 99.99 but the actual price is 19.99, the data is valid (it's a number) but inaccurate. Production pipelines must enforce validity automatically, while accuracy usually requires statistical sampling.

03Cross-field dependencies

Advanced validity checks go beyond single fields to evaluate the logical consistency of the entire record. For example, if a scraper extracts a discount_price and a base_price, a cross-field validity check ensures that the discount is strictly less than the base price. If a target site accidentally swaps the DOM elements for these two values, field-level type checks will pass, but the cross-field logic will catch the error and quarantine the record.

04How DataFlirt handles it

We treat schema validation as a hard gate at the edge. Every extraction worker runs a compiled Rust validator against the parsed JSON before it is allowed to write to the delivery bucket. If a record fails, we do not attempt to guess or silently nullify the field. The record is routed to a dead-letter queue, and the pipeline's validity score is updated. If the score drops below our SLA threshold, the pipeline halts and pages an engineer to investigate the selector drift.

05The silent failure of type coercion

The most dangerous scraping bugs don't throw HTTP 500s; they write bad data quietly. If your scraper uses a dynamic language like Python or JavaScript and extracts an empty string "" for a missing numeric field, the language might coerce that into a 0 when writing to a CSV. Downstream, your analytics team will calculate an average price that is artificially dragged down by thousands of fake zeros. Strict validity enforcement prevents language-level coercion from masking extraction failures.

// 03 — validation metrics

How strict
is your schema?

Validity is measured as a ratio of conforming records to total extracted records. DataFlirt tracks validity per-field and per-pipeline to catch silent selector drift before it pollutes the delivery bucket.

Field-level validity = Vf = valid_records / total_records
A sudden drop in V_f usually indicates a site layout change, not bad data. Standard ETL metric
Pipeline quarantine rate = Q = failed_records / total_extracted
Records that fail validation are quarantined. Q > 0.05 triggers an on-call alert. DataFlirt extraction SLO
DataFlirt delivery SLA = SLO = 1 − (type_errors / delivered_rows)
We guarantee >0.999 validity on delivered datasets via synchronous edge validation. Internal Data Contract
// 04 — validation pipeline trace

Catching poisoned types
before they write.

A live trace of a DataFlirt extraction worker processing a scraped JSON payload against a strict schema contract. The record is accurate to the DOM, but invalid to the schema.

Rust validatortype coerciondead-letter queue
edge.dataflirt.io — live
CAPTURED
// inbound record from extraction worker
record.id: "sku-8841"

// type assertions
check.price_numeric: "₹1,299" -> type mismatch (string)
coerce.price: 1299 -> ok
check.currency: "INR" -> ok

// range bounds
check.stock_count: -5 -> out of bounds [0, MAX_INT]

// regex format
check.sku_format: match(/^[A-Z]{3}-\d{4}$/) -> fail

// cross-field logic
check.discount: 1500 > base_price: 1299 -> logic violation

// outcome
status: INVALID
action: route to dead-letter queue
s3.write: ABORTED
// 05 — invalidation vectors

Why valid selectors
yield invalid data.

The most common reasons extracted data fails schema validation, ranked by frequency across DataFlirt's B2B pipelines. Most of these stem from target sites using presentation logic instead of strict data types.

PIPELINES MONITORED ·   300+ active
QUARANTINED ROWS ·  ·  ·  1.2M / day
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Type coercion failures

% of failures · Strings in numeric fields (e.g. 'Out of Stock')
02

Out-of-bounds values

% of failures · Negative prices, future birth dates
03

Format / Regex mismatches

% of failures · Malformed emails, invalid postal codes
04

Cross-field logic violations

% of failures · Discount price > Base price
05

Encoding artifacts

% of failures · Zero-width spaces breaking string matches
// 06 — validation architecture

Validate at the edge,

quarantine before the warehouse.

DataFlirt enforces data validity synchronously during the extraction phase. We compile your schema into Rust-based validation rules that run directly on the worker node. If a record fails, it doesn't get silently dropped or written as a null — it gets routed to a quarantine queue with the raw HTML attached. This allows our engineers to replay the extraction with patched logic, ensuring zero data loss while maintaining strict warehouse hygiene.

schema-validator.rs

Live telemetry from a validation worker on a high-volume pricing pipeline.

pipeline.id b2b-pricing-in
records.processed 1,042,881ok
validity.score 0.998within SLO
quarantined.rows 2,085review pending
top_failure.field discount_pct
top_failure.reason out_of_bounds (>100)
action.taken alert_oncalldispatched

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About data validity, schema enforcement, type coercion, and how DataFlirt prevents bad data from breaking your downstream pipelines.

Ask us directly →
What is the difference between data accuracy and data validity? +
Accuracy measures whether the data reflects reality (e.g., is the price actually $10?). Validity measures whether the data conforms to the schema (e.g., is the price a positive float?). A scraper can extract "Call for Price" accurately from the page, but that string is invalid if your database expects a decimal.
How should I handle valid text in a numeric field? +
Never change your schema type to a string just to accommodate edge cases like "Out of Stock". Keep the numeric field strict. Extract the text into a separate availability_status string field, and write a null to the numeric field. This preserves validity and keeps your aggregations from crashing.
When is the best time to validate scraped data? +
At the extraction edge, synchronously. If you wait until the data is in your warehouse to run dbt tests, you've already polluted your raw layer. Validating at the edge allows you to quarantine the record alongside the raw HTML payload, making it trivial to debug and replay the extraction.
How does DataFlirt handle schema drift? +
We monitor field-level validity rates in real time. If a site redesign causes a price selector to start picking up a string instead of a number, the validity score drops instantly. This triggers an automated alert, quarantines the affected records, and pauses delivery until an engineer patches the selector.
What happens to records that fail validation? +
They are routed to a dead-letter queue (DLQ) along with the raw HTTP response body. They are never silently dropped. Once the extraction logic is fixed, we replay the raw responses from the DLQ through the updated parser, recovering the data without needing to re-scrape the target.
Can I define custom validation logic? +
Yes. Beyond standard type and range checks, DataFlirt supports custom regex patterns, cross-field logic (e.g., end_date must be after start_date), and dictionary lookups (e.g., currency must be in a predefined list of ISO codes).
$ dataflirt scope --new-project --target=data-validity READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h