← Glossary / Data Audit

What is Data Audit?

Data audit is the systematic evaluation of extracted records against a defined schema, source of truth, and historical baseline to verify accuracy, completeness, and consistency. In scraping pipelines, it acts as the final gatekeeper before data hits the delivery sink. Without continuous auditing, silent failures like type coercion errors or selector drift will corrupt your downstream analytics, turning a high-volume pipeline into a liability rather than an asset.

Data QualitySchema ValidationAnomaly DetectionETLCompliance
// 02 — definitions

Trust, but
verify.

Why fetching the bytes is only half the battle, and how systematic auditing prevents garbage data from poisoning your warehouse.

Ask a DataFlirt engineer →

TL;DR

A data audit evaluates scraped records for schema adherence, statistical anomalies, and source accuracy. It's the difference between knowing your pipeline ran and knowing your pipeline produced valid data. Modern stacks run audits continuously on every micro-batch, quarantining bad records before they merge into the gold layer.

01Definition & structure
A data audit is an automated process that validates extracted records against predefined rules before they are stored or delivered. It typically consists of three layers: structural validation (checking data types, nulls, and schema adherence), business logic validation (ensuring values fall within expected ranges or match specific enums), and statistical validation (detecting anomalies against historical baselines).
02How it works in practice
In a modern scraping pipeline, auditing happens continuously. As a micro-batch of data is extracted, it passes through an audit worker. The worker runs a suite of tests (e.g., "is `price` a float?", "is `sku` unique?", "is the null rate for `description` < 5%?"). If the batch passes, it is written to the delivery sink. If it fails, the batch is routed to a quarantine queue, and an alert is fired for engineering review.
03The three pillars of auditing
Effective audits measure three dimensions. Completeness: Are all expected fields present? Accuracy: Do the extracted values match the source of truth? (Usually verified via spot-checks). Consistency: Are data types and formats uniform across the entire dataset? A failure in any of these pillars indicates a broken pipeline.
04How DataFlirt handles it
We treat data contracts as code. Every DataFlirt pipeline has a strictly versioned schema. Our audit layer runs 100% structural validation on every record extracted. If a target site changes its layout and a selector starts returning empty strings instead of prices, our audit engine catches the type coercion failure instantly, quarantines the batch, and pages our on-call engineers. You never receive poisoned data.
05The silent failure most pipelines miss
Type coercion errors are the most common silent killer in scraping. A site updates its UI, and a price field that used to contain "49.99" now contains "Out of Stock". If your pipeline doesn't audit types, it will either crash downstream or silently insert nulls. Auditing catches this at the edge, turning a silent data corruption event into a loud, actionable alert.
// 03 — the metrics

How healthy
is the batch?

DataFlirt's audit engine evaluates every micro-batch against historical baselines and schema contracts. These are the core metrics that determine if a batch is delivered or quarantined.

Completeness Score = C = 1 − (null_fields / total_expected_fields)
Drops below 0.99 trigger an immediate selector review. DataFlirt extraction SLO
Anomaly Z-Score = Z = (xμ) / σ
Flags sudden spikes in price or volume against a 30-day rolling mean. Statistical baseline check
Delivery Yield = Y = records_passed / records_extracted
The percentage of data that actually survives the audit layer. Pipeline health metric
// 04 — audit trace

Quarantining a
poisoned batch.

A live trace of a DataFlirt audit worker evaluating a 50k-record micro-batch from a retail pipeline. A silent site update caused price strings to parse incorrectly.

Great ExpectationsSchema v4Quarantine
edge.dataflirt.io — live
CAPTURED
// batch ingestion
batch.id: "audit-ret-092"
records.count: 50,000
schema.target: "retail_product_v4"

// phase 1: structural validation
check.completeness: 0.998 // pass
check.types: 4,102 failures
error.detail: "expected numeric, got string 'Call for Price'"

// phase 2: statistical anomaly detection
metric.avg_price: $0.00 // deviation > 3σ
metric.null_rate_category: 0.01 // pass

// phase 3: routing
status: FAILED
action: QUARANTINE_BATCH
alert: "pagerduty_oncall_data_eng"
delivery.s3: BLOCKED
// 05 — failure modes

What triggers
an audit failure.

Ranked by frequency across DataFlirt's managed pipelines. Structural failures are caught immediately; statistical anomalies require historical context to detect.

PIPELINES MONITORED ·   300+ active
CHECKS RUN ·  ·  ·  ·  ·  10M+ per day
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Type coercion errors

% of failures · String instead of float, invalid dates
02

Missing mandatory fields

% of failures · Selector drift causing nulls
03

Statistical anomalies

% of failures · Price dropped 90% overnight
04

Referential integrity

% of failures · Orphaned category IDs
05

Stale data

% of failures · Timestamp unchanged for 48h
// 06 — continuous auditing

Audit every record,

not just the final table.

Batch-level spot checks are insufficient for high-velocity scraping. DataFlirt implements continuous auditing at the micro-batch level. Every record is evaluated against a versioned data contract before it touches the delivery sink. If a target site pushes a silent update that breaks price extraction, the audit layer catches the type mismatch, quarantines the affected records, and alerts our engineers—ensuring your downstream models never ingest poisoned data.

Audit Worker Status

Real-time evaluation of an incoming data stream.

worker.id audit-node-14
schema.version v4.2
records.scanned 1.2M/hr
quarantine.rate 0.04%
anomaly.detection active
last_failure 14h ago
sink.status writing to S3

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About data auditing, schema validation, anomaly detection, and how DataFlirt ensures data quality at scale.

Ask us directly →
What is the difference between data cleaning and data auditing? +
Data cleaning transforms and fixes data — stripping whitespace, normalising currencies, or imputing missing values. Data auditing measures and verifies the data against strict rules. Auditing tells you if the cleaning worked and if the scraper is functioning correctly. They are distinct pipeline stages.
How much data should I audit? +
100% for structural checks like data types, null constraints, and enum matches. Statistical checks (like anomaly detection on price distributions) can be sampled, but modern compute allows full-table scans for most pipelines. Never rely solely on 1% spot checks for structural integrity.
What happens when an audit fails? +
The batch is quarantined. It should never be merged into the production table or delivered to the client. Engineers investigate the failure, fix the scraper or parser, and backfill the missing data. Delivering bad data is always worse than delivering late data.
How does DataFlirt handle schema drift during an audit? +
We version our schemas. If a site changes and fields go missing, the audit fails, triggering an automated alert. We update the selector, bump the schema version, and re-process the quarantined data. The client's data contract remains intact.
Can I define custom audit rules for my pipeline? +
Yes. Beyond standard type and null checks, clients define business-logic rules — e.g., "price must be between $10 and $5000" or "stock status must match one of three specific strings". If the target site violates these rules, we catch it before delivery.
Why not just fix the data downstream in the warehouse? +
Garbage in, garbage out. Fixing data downstream creates massive technical debt, breaks automated dashboards, and makes root-cause analysis impossible when the scraper was the actual issue. Auditing must happen at the edge, immediately after extraction.
$ dataflirt scope --new-project --target=data-audit READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h