← Glossary / Data Downtime

What is Data Downtime?

Data downtime is the period when a data pipeline delivers missing, stale, or structurally invalid records to downstream consumers. Unlike application downtime where a server returns a 500 error, data downtime is often silent — the pipeline runs successfully, but the payload is poisoned by upstream schema drift or anti-bot tarpits. If your pricing models consume stale data for 12 hours before anyone notices, that's 12 hours of data downtime.

Data QualityPipeline ObservabilitySchema DriftSLASilent Failures
// 02 — definitions

Silent
failures.

When the scraper runs perfectly but the data is garbage, the pipeline is down. Here is how to measure and mitigate it.

Ask a DataFlirt engineer →

TL;DR

Data downtime measures the time your data is inaccurate, missing, or stale, regardless of whether the infrastructure is running. In web scraping, it is almost entirely driven by silent failures: selector rot, type coercion errors, and undetected anti-bot mitigations. The goal is to catch these anomalies at the extraction layer before they pollute the data warehouse.

01Definition & structure

Data downtime is the period during which a data system is delivering compromised data. It is calculated as the sum of Time to Detect (TTD) and Time to Resolve (TTR). In the context of web scraping, data is considered "down" if it violates any of three constraints:

  • Completeness: Expected fields or records are missing (e.g., pagination broke).
  • Accuracy/Validity: The data is malformed or typed incorrectly (e.g., a string instead of a float).
  • Freshness: The data is older than the required SLA (e.g., scraping a cached page).
02The silent failure problem

Software engineers are trained to monitor infrastructure: CPU, memory, HTTP 5xx rates. But a scraper can have 100% uptime, return HTTP 200 OKs, and successfully write to a database while delivering absolute garbage. If a target site changes a CSS class name, the scraper might quietly write null for every price field. The infrastructure dashboard is green, but the data pipeline is experiencing a catastrophic outage. This is why data downtime requires specialized observability.

03Measuring the business impact

The cost of data downtime scales non-linearly with Time to Detect (TTD). If bad data is caught at the extraction layer, the cost is just the compute time to re-run the job. If bad data makes it into the data warehouse, it corrupts downstream dashboards and ML models. If it reaches the end-user or an automated pricing engine, the cost becomes reputational and financial. Shifting detection to the absolute edge of the pipeline is the only way to control this cost.

04How DataFlirt handles it

We treat data contracts with the same severity as API contracts. Every DataFlirt pipeline runs an extraction validator that asserts the schema of every single record in real-time. If a batch of records violates the completeness or type thresholds, a circuit breaker trips. The bad data is quarantined, the delivery to the client's S3 bucket is halted, and our on-call engineers are paged. We guarantee that you will never ingest a poisoned dataset from our infrastructure.

05Did you know?

According to industry surveys, over 70% of data downtime incidents are first discovered by business stakeholders or end-users, not by the data engineering team. This "silent failure" dynamic is the primary reason trust degrades between data consumers and data producers. Implementing automated schema validation is the fastest way to reverse this trend.

// 03 — the metrics

How to quantify
data downtime.

Data downtime isn't just a binary state; it's a function of time, volume, and severity. DataFlirt tracks these metrics per pipeline to enforce our data quality SLAs.

Total Data Downtime = TTD + TTR
Time to Detect + Time to Resolve. TTD is usually the bottleneck. Standard Data Engineering Metric
Data Reliability Score = 1 − (Downtime_hours / Total_hours)
Uptime for data. 99.9% reliability means ≤43 minutes of bad data per month. DataFlirt SLA Definition
Poisoned Record Ratio = Records_failed_schema / Total_records_extracted
Used to trigger automated circuit breakers when anomalies spike. DataFlirt Extraction Validator
// 04 — pipeline observability

Catching schema drift
before it ships.

A live trace of DataFlirt's extraction validator catching a silent failure. The target site changed their price formatting, triggering a quarantine event instead of shipping nulls.

schema validationquarantinealerting
edge.dataflirt.io — live
CAPTURED
// job.init: extract-b2b-pricing-eu
schema.version: "v4.2"
records.expected: 150,000

// extraction phase
batch_01.status: processing
record.id: "sku-99281"
field.price.raw: "Contact for Price" // upstream DOM change

// validation phase
assert.type(price): FAILED // expected float, got string
assert.completeness: 0.82 // dropped below 0.99 threshold

// circuit breaker
action: QUARANTINE_BATCH
delivery.s3: BLOCKED // preventing warehouse pollution
alert.pagerduty: DISPATCHED "Schema drift detected on price field"
pipeline.state: DATA_DOWNTIME_ACTIVE
// 05 — root causes

Where data downtime
originates.

Ranked by frequency across DataFlirt's incident post-mortems. Infrastructure failures are rare; upstream target changes are the dominant cause of poisoned datasets.

INCIDENTS ANALYSED ·  ·   1,200+ post-mortems
AVG TTD (INDUSTRY) ·  ·   4.5 days
AVG TTD (DATAFLIRT) ·   < 5 minutes
01

Schema drift / selector rot

silent failure · Target site redesigns DOM, fields return null
02

Anti-bot silent tarpits

silent failure · 200 OK returned, but HTML contains fake data
03

Type coercion errors

transform failure · Currency symbols break float parsing downstream
04

Pagination breaks

completeness failure · Crawler stops at page 10 instead of 10,000
05

Target server caching

freshness failure · Target CDN serves stale HTML to the scraper
// 06 — our architecture

Validate at the edge,

never ship a broken schema.

DataFlirt eliminates silent data downtime by shifting validation to the extraction worker. Every record is asserted against a versioned schema contract before it hits the delivery queue. If a target site redesigns their product page and the price selector starts returning a string instead of a float, the worker quarantines the record and halts the job. We prefer a loud, immediate pipeline failure over quietly delivering a poisoned dataset to your warehouse.

Extraction Validator State

Real-time schema assertion on a B2B pricing pipeline.

pipeline.id df-pricing-eu-09
schema.contract v4.2enforced
records.scanned 148,291
type.violations 0
null.rate 0.002within bounds
circuit_breaker armed
downtime.status 0 seconds

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about measuring data downtime, setting SLAs, and how DataFlirt guarantees data quality at scale.

Ask us directly →
What is the difference between infrastructure downtime and data downtime? +
Infrastructure downtime means the scraper crashed or the server is offline (HTTP 500s, timeouts). Data downtime means the infrastructure is running perfectly, but the data it produces is wrong, missing, or stale. Data downtime is far more dangerous because it doesn't trigger standard DevOps alerts like CPU spikes or 5xx error rates.
How do you detect stale data if the schema is perfectly valid? +
By tracking entity-level mutation rates. If a highly volatile e-commerce category suddenly shows zero price changes across 10,000 SKUs for 48 hours, the schema is valid but the data is likely stale (often due to the target's CDN caching our requests). We use statistical anomaly detection on the payload values, not just the schema types, to catch freshness failures.
Who is responsible for data downtime in a vendor relationship? +
If you buy raw proxies, you are responsible. If you buy a managed pipeline from DataFlirt, we are responsible. Our SLAs cover data reliability, not just uptime. If a target site changes their layout and the data drops, that counts against our SLA. We own the Time to Detect (TTD) and Time to Resolve (TTR).
How does DataFlirt minimize Time to Resolution (TTR) when a site changes? +
We decouple the extraction logic from the fetch layer. When a selector breaks, we don't need to re-crawl the target. We patch the selector in our central registry, and the extraction workers immediately re-process the raw HTML payloads stored in our short-term raw data zone. TTR is often under 15 minutes, with zero missed records.
Is it better to deliver partial data or halt the pipeline entirely? +
Halt and quarantine. Delivering partial or malformed data corrupts downstream data warehouses, ruins historical aggregations, and forces data engineers to run complex backfill scripts to clean up the mess. A delayed dataset is an operational headache; a corrupted dataset is a business risk.
Can data downtime have legal or compliance implications? +
Yes. If your pipeline feeds automated trading algorithms, dynamic pricing engines, or compliance monitoring systems, acting on stale or poisoned data can lead to financial loss or regulatory breaches. This is why treating data as a strict contract — with circuit breakers that stop the flow of bad data — is a hard requirement for enterprise pipelines.
$ dataflirt scope --new-project --target=data-downtime READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h