← Glossary / Outlier Detection

What is Outlier Detection?

Outlier detection is the automated process of identifying extracted records that deviate significantly from an expected statistical distribution or business logic bounds. In scraping pipelines, it acts as the final defense against silent schema drift, currency conversion errors, and promotional edge cases that bypass standard type validation. If a $1,200 laptop is scraped as $12.00 because a selector grabbed the monthly finance installment instead of the base price, outlier detection quarantines the record before it poisons your data warehouse.

Data QualityAnomaly DetectionZ-ScoreQuarantineETL
// 02 — definitions

Catching the
quiet failures.

Why type validation isn't enough, and how statistical bounds prevent structurally valid but logically broken data from reaching production.

Ask a DataFlirt engineer →

TL;DR

Outlier detection sits between the extraction and delivery layers. While schema validation ensures a field is a number, outlier detection ensures that number makes sense in context. It uses historical distributions, peer comparisons, and hard bounds to flag anomalies like a 90% price drop or a 10x spike in review counts, quarantining them for review.

01Definition & structure

Outlier detection is the process of identifying data points that differ significantly from the majority of the dataset. In web scraping, it is a critical post-extraction step used to catch parsing errors, selector drift, and edge-case site logic that produces technically valid but logically incorrect data.

A typical outlier detection system evaluates:

  • Historical variance — does this value deviate from its own 30-day moving average?
  • Peer variance — does this value deviate from similar items in the same scrape batch?
  • Absolute bounds — does this value violate predefined business logic (e.g., age < 0)?
02Statistical vs. Heuristic bounds

There are two primary ways to catch outliers. Statistical bounds use math (Z-scores, IQR, MAD) to dynamically calculate what is "normal" based on the data itself. This is great for highly variable datasets where hard limits are impossible to define.

Heuristic bounds use hardcoded business rules (e.g., price > 0 and price < 100000). In production scraping, heuristics are often safer and faster. A hybrid approach—using heuristics for absolute limits and statistics for sudden relative changes—yields the lowest false positive rate.

03The silent schema drift problem

Most outliers in scraping aren't caused by the target site entering bad data; they are caused by silent schema drift. If an e-commerce site redesigns its product card to show the "Monthly Installment" in the large font previously used for the "Total Price", your CSS selector will happily extract the installment.

Because the extracted string is still a valid currency format, schema validation passes. Without outlier detection, that 90% price drop is written directly to your database, potentially triggering automated repricing algorithms or corrupting market analysis.

04How DataFlirt handles it

We treat outlier detection as a core component of our delivery SLA. Every numeric field in a DataFlirt pipeline can be configured with MAD-based statistical checks or hard boundary rules. When a record triggers an alert, we don't drop it—we quarantine it in a dead-letter queue.

Our data engineers review the DLQ daily. If the outlier is a genuine extraction error, we patch the pipeline and backfill. If it's a real-world anomaly, we release the record. This human-in-the-loop approach ensures high data integrity without permanently losing valid edge cases.

05The false positive trap

The biggest risk in outlier detection is tuning the sensitivity too high, resulting in false positives where legitimate data is quarantined. Using standard deviation (Z-score) on scraping data is a common mistake because scraping data is rarely normally distributed—it is often heavily skewed (e.g., a few very expensive items in a catalog of cheap items).

This is why robust statistics like Median Absolute Deviation (MAD) are preferred. The median is not dragged upward by extreme values the way an average is, making the baseline much more stable against the very outliers it is trying to detect.

// 03 — the math

How to flag
an anomaly.

DataFlirt uses a combination of robust statistics and rolling historical windows to score incoming records. We prefer Median Absolute Deviation (MAD) over standard Z-scores because scraping data is rarely normally distributed and means are easily skewed by the outliers themselves.

Median Absolute Deviation (MAD) = MAD = median(|xi − median(X)|)
A robust measure of variability. Less sensitive to extreme outliers than standard deviation. Robust Statistics
Modified Z-Score = Mi = 0.6745 · (xi − median(X)) / MAD
Records with |M| > 3.5 are typically flagged as potential outliers. Iglewicz and Hoaglin, 1993
DataFlirt Quarantine Threshold = Q = Mi > 3.5xi ∉ [Bmin, Bmax]
Triggers quarantine if statistically anomalous OR outside hard business bounds. DataFlirt delivery pipeline
// 04 — pipeline trace

Quarantining a
price collapse.

A live trace of an extraction worker processing an e-commerce catalog. The selector is valid, the type is correct, but the value is a statistical anomaly due to a site layout change showing EMI instead of full price.

price validationMAD scoringquarantine
edge.dataflirt.io — live
CAPTURED
// record extraction
record.id: "sku_99482_IN"
field.price.raw: "₹1,499/mo"
field.price.parsed: 1499.00 // type: float64

// historical context fetch
history.window: "30d"
history.median: 84500.00
history.mad: 1200.00

// outlier evaluation
eval.modified_z: -46.6 // threshold: ±3.5
eval.delta_pct: -98.2%

// routing decision
status: QUARANTINED
reason: "MAD threshold exceeded on field 'price'"
action: "routed to dead-letter queue for manual review"
// 05 — failure modes

Where the bad
numbers come from.

Ranked by frequency across DataFlirt's e-commerce and real estate pipelines. Most outliers aren't random noise — they are structural extraction errors masquerading as valid data.

PIPELINES MONITORED ·   300+ active
QUARANTINE RATE ·  ·  ·   0.14% of records
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Promotional pricing / EMI text

% of outliers · Selector grabs '₹999/mo' instead of '₹89,999'
02

Unit of measure mismatch

% of outliers · Price per gram vs price per kilogram
03

Decimal parsing errors

% of outliers · EU comma vs US period decimal separators
04

Default / placeholder values

% of outliers · System defaults like 999999 or 0
05

Currency symbol misinterpretation

% of outliers · USD vs CAD vs AUD parsed as generic '$'
// 06 — our architecture

Trust the schema,

verify the distribution.

DataFlirt runs outlier detection asynchronously on the delivery queue. We maintain a rolling 30-day distribution profile for every numeric field in a pipeline. When a record exceeds the configured MAD threshold, it is routed to a dead-letter queue for human review. This ensures that a sudden, legitimate market shift doesn't break the pipeline, but a broken CSS selector doesn't corrupt your historical dataset.

outlier-eval.log

Evaluation metrics for a batch of real estate listings.

batch.id re-IN-blr-042
records.total 14,200
schema.passed 14,200
eval.field price_per_sqft
outliers.flagged 12 records
outliers.reason MAD > 3.5
batch.status 14,188 delivered12 quarantined

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About statistical bounds, handling false positives, and how DataFlirt prevents bad data from reaching your warehouse.

Ask us directly →
What is the difference between schema validation and outlier detection? +
Schema validation checks if the data is the right type (e.g., is this field a float?). Outlier detection checks if the value makes logical sense (e.g., is it normal for a car to cost $4.50?). A record can perfectly pass schema validation but still be a catastrophic outlier due to a selector grabbing the wrong element on the page.
How do you handle legitimate price drops, like Black Friday sales? +
We use a combination of statistical bounds and hard business rules. A 50% drop might trigger a warning, but if it's Black Friday, the entire distribution shifts simultaneously. Our models look at peer records in the same batch. If one SKU drops 90%, it's an outlier. If 40% of the catalog drops 50%, the median shifts, and the records pass.
What happens to quarantined records? +
They are routed to a dead-letter queue (DLQ) and excluded from the client's delivery payload. Our data engineering team reviews the DLQ daily. If the outlier is a genuine extraction error, we fix the selector and backfill. If it's a legitimate anomaly (e.g., a flash sale), we force-approve the record and it is delivered in the next batch.
Can I set custom business rules instead of statistical bounds? +
Yes. For many pipelines, hard bounds are safer than statistical models. If you are scraping commercial real estate, you can set a hard rule: price_per_sqft must be between ₹2,000 and ₹50,000. Anything outside that range is quarantined immediately, bypassing the MAD calculation entirely.
How much latency does outlier detection add to the pipeline? +
Almost none. The historical medians and MAD values are pre-computed and cached in Redis. Evaluating a new record against these cached thresholds takes less than 2 milliseconds per record. It runs inline during the transform step before S3 delivery.
Does DataFlirt use machine learning for outlier detection? +
For standard numeric fields, no — robust statistics like MAD are faster, more interpretable, and easier to debug. We use ML (specifically Isolation Forests) only for multivariate anomaly detection, where a combination of fields (e.g., weight, dimensions, and price) look normal individually but are anomalous when combined.
$ dataflirt scope --new-project --target=outlier-detection READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h