← Glossary / Data Quality

What is Data Quality?

Data Quality in a scraping context is the measurable degree to which extracted records match the real-world state of the target site, conform to expected schemas, and arrive on time. It is not a vague feeling of correctness; it is a strict SLA encompassing completeness, validity, freshness, and uniqueness. When data quality degrades silently, downstream machine learning models hallucinate and pricing algorithms misfire, turning a data pipeline from an asset into a liability.

Data CleaningSLAValidationSchema DriftAnomaly Detection
// 02 — definitions

Trust, but
verify.

The operational framework for ensuring that the bytes you scrape actually represent the business reality you are trying to measure.

Ask a DataFlirt engineer →

TL;DR

Data quality is the difference between a successful HTTP 200 and a usable dataset. It requires continuous validation of schema contracts, type coercion, and anomaly detection using tools like Soda Core or Great Expectations. Without automated quality gates, selector rot and target site updates will silently poison your data warehouse.

01Definition & structure
Data Quality in web scraping is a multi-dimensional metric that evaluates the reliability of extracted information. It is structured around five core pillars: completeness (are all fields present?), validity (do they match the schema?), accuracy (do they reflect the source?), consistency (are formats uniform?), and timeliness (is the data fresh?). Without a formal quality framework, a scraping pipeline is just a random number generator.
02The dimensions of quality
Quality is not a binary state. A dataset might be 100% complete but entirely inaccurate if an anti-bot system fed the scraper decoy prices. Conversely, it might be perfectly accurate but inconsistent if dates are formatted as MM/DD/YYYY on one page and DD-MM-YYYY on another. True data quality requires measuring and enforcing all dimensions simultaneously.
03Silent failures in scraping
The most dangerous failures in scraping do not throw HTTP 500 errors. They return HTTP 200 OK with slightly altered DOM structures. A CSS class changes from .price-tag to .price-val, and suddenly your pipeline is writing null to the database for 10,000 records. If you do not have automated completeness checks, this silent failure will propagate directly into your analytics dashboards.
04How DataFlirt handles it
We enforce data contracts at the edge. Every record extracted by our fleet is evaluated against a versioned schema before delivery. We use statistical anomaly detection to flag sudden shifts in volume or average values. If a batch fails the quality gate, it is routed to a quarantine queue, and our engineers are paged to investigate. We never deliver poisoned data to a client.
05The cost of bad data
IBM estimates that bad data costs the US economy $3.1 trillion annually. In the context of scraping, bad data leads to algorithmic mispricing, flawed machine learning models, and broken competitive intelligence. Fixing data upstream at the extraction layer costs pennies; fixing it downstream in the data warehouse costs thousands of dollars in engineering time and lost revenue.
// 03 — the metrics

How do you
measure quality?

Quality is quantified across multiple dimensions. DataFlirt tracks these metrics per pipeline, per run, automatically quarantining records that fall below the defined threshold before they hit the client's S3 bucket.

Completeness = C = 1 − (null_fields / total_expected_fields)
Measures missing data. A drop in C usually indicates a broken CSS selector. Data Quality Dimensions
Validity Rate = V = records_passing_schema / total_extracted_records
Strict type and bounds checking. V must be > 0.99 for production pipelines. DataFlirt Validation SLO
Data Downtime = D = time_to_detect + time_to_resolve
The total duration where data is missing, stale, or inaccurate. Data Engineering standard
// 04 — validation trace

Catching bad data
before delivery.

A live trace of a DataFlirt validation worker processing a batch of scraped e-commerce records against a strict schema contract.

Soda CoreJSON validationQuarantine
edge.dataflirt.io — live
CAPTURED
// batch ingestion
batch.id: "b_7892_prod"
records.total: 45,000

// schema validation phase
check.types: passed // all prices are numeric
check.bounds: failed // 12 records with price < 0.01
check.completeness: passed // 0.998 > threshold 0.990

// anomaly detection
metric.avg_price_shift: "-2.4%" // within normal variance
metric.total_volume: "45,000" // matches historical baseline

// routing
records.quarantined: 12 // routed to manual review queue
records.delivered: 44,988
status: DELIVERY SUCCESS
// 05 — failure modes

Where data quality
breaks down.

The most common root causes of data quality degradation in production scraping pipelines, ranked by frequency across our fleet.

PIPELINES MONITORED ·   850+
RECORDS/DAY ·  ·  ·  ·    140M+
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Schema drift / selector rot

silent failure · Target site changes layout, returning nulls or wrong text
02

Type coercion errors

format shift · Currency symbols or commas breaking numeric parsers
03

Pagination drops

completeness · Infinite scroll or offset logic failing mid-crawl
04

Anti-bot poisoning

accuracy · Fake prices served to detected bots (tarpitting)
05

Stale cache returns

freshness · CDNs serving outdated HTML to the scraper
// 06 — our architecture

Validate at the edge,

quarantine the anomalies, deliver the rest.

DataFlirt treats data quality as a continuous integration problem. We do not rely on downstream consumers to find nulls or string-encoded prices. Every record passes through a validation layer that enforces type constraints, bounds checking, and historical anomaly detection. If a target site changes its layout and prices drop by 90%, the pipeline pauses and alerts an engineer. Bad data never silently overwrites good data.

Validation Worker Status

Real-time metrics from a quality gate on a pricing pipeline.

pipeline.id retail-pricing-eu
schema.version v4.2.1locked
completeness.score 0.997passing
type.violations 0clean
anomaly.score 0.85review
quarantine.queue 42 records pending
delivery.status writing to s3

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about maintaining data quality, handling schema drift, and enforcing strict SLAs in scraping pipelines.

Ask us directly →
What is the difference between data validity and data accuracy? +
Validity means the data matches the expected format (e.g., a price is a positive float). Accuracy means the data matches the real world (e.g., the price is actually $12.99 on the site). A scraper returning $0.00 for every product is perfectly valid, but completely inaccurate. You need both.
How do you handle schema drift without losing data? +
By decoupling extraction from delivery. When a selector breaks, the validation layer catches the resulting nulls or type mismatches and routes those records to a quarantine queue. The pipeline alerts our engineers, who fix the selector and replay the raw HTML through the updated extractor. No data is lost, and no bad data is delivered.
What is data downtime? +
Data downtime is the period when data is missing, erroneous, or otherwise unusable. In scraping, it usually starts when a target site deploys an update and ends when the scraper is patched and backfilled. Minimizing data downtime requires automated anomaly detection—you cannot wait for a stakeholder to notice a broken dashboard.
How does DataFlirt guarantee completeness? +
We define strict completeness thresholds per field in the data contract. If a non-optional field (like a product ID) is missing, the record is rejected. If the overall batch completeness drops below the SLA (e.g., 99%), the entire delivery is halted and flagged for engineering review.
Should we drop bad records or quarantine them? +
Always quarantine. Dropping records silently destroys your completeness metrics and hides the root cause of the failure. Quarantining preserves the failed state, allowing engineers to inspect the exact HTML that caused the failure, fix the parsing logic, and recover the data.
Does data quality include privacy scrubbing? +
Yes. For pipelines operating in GDPR or CCPA jurisdictions, data quality includes ensuring that PII (Personally Identifiable Information) is not accidentally ingested. Our validation layer includes regex-based PII scanners that flag and redact emails, phone numbers, or credit cards before they reach the storage layer.
$ dataflirt scope --new-project --target=data-quality READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h