← Glossary / Error Rate (Scraping)

What is Error Rate (Scraping)?

Error Rate (Scraping) is the percentage of extraction attempts that fail to return a valid, schema-compliant record. In data pipelines, not all errors are equal: a 403 Forbidden indicates an anti-bot block, a 404 indicates stale discovery, and a schema validation failure means the target site changed its layout. Tracking aggregate error rate is useless; you must segment by failure domain to know whether to rotate proxies, update selectors, or throttle concurrency.

ObservabilityPipeline HealthAnti-Bot BlocksSchema DriftSLOs
// 02 — definitions

Signal vs
noise.

Why a 0% error rate usually means your monitoring is broken, and how to classify failures to keep pipelines running.

Ask a DataFlirt engineer →

TL;DR

Error rate measures the ratio of failed jobs to total attempts. Production pipelines typically target a sub-2% error rate, but the composition of those errors matters more than the absolute number. A spike in network timeouts requires proxy rotation, while a spike in type coercion failures requires a schema bump.

01Definition & structure
Error rate in a scraping context is the ratio of failed extraction attempts to total attempts. A failure is not just an HTTP error; it includes network timeouts, proxy authentication failures, anti-bot challenges, and schema validation errors. Tracking the aggregate number is less important than segmenting the errors by domain (network, HTTP, extraction) to determine the correct automated response.
02The taxonomy of scraping errors
Scraping errors fall into three distinct layers:
  • Network layer: DNS failures, connection resets, and proxy timeouts. Usually solved by rotating the exit node.
  • HTTP layer: 403 Forbidden (bot block), 429 Too Many Requests (rate limit), 503 Service Unavailable (target overload). Requires logic changes like backoff or fingerprint rotation.
  • Extraction layer: Missing required fields, type coercion failures, or truncated HTML. Requires human intervention to update selectors or schemas.
03Why 0% is a red flag
If your pipeline processes millions of records and reports a 0% error rate, your monitoring is almost certainly broken. At scale, target servers drop connections, residential proxies go offline, and products get deleted (404s). A perfect success rate usually means your error handler is swallowing exceptions and returning empty records as "success", silently poisoning your downstream dataset.
04How DataFlirt handles it
We segment error rates by target, proxy pool, and failure type. An isolated 403 spike triggers an automatic proxy rotation and fingerprint adjustment; a schema failure quarantines the specific record and alerts an engineer without stopping the rest of the job. By isolating failures, we prevent a single broken selector from failing an entire 10-million-record batch.
05The retry trap
Blindly retrying every error inflates your error rate and burns proxy bandwidth. 404s and schema errors should never be retried. Retrying a 403 without changing your approach just confirms to the anti-bot system that you are an automated script. Smart pipelines only retry transient errors (like 502s or timeouts) and use exponential backoff to avoid hammering the target.
// 03 — pipeline math

Calculating
true failure.

A naive error rate just counts HTTP 500s. A production error rate includes silent failures like empty JSON payloads and schema violations. Here is how DataFlirt calculates pipeline health.

True Error Rate = E = (http_err + schema_err + timeout) / total_attempts
Must include extraction failures, not just network drops. DataFlirt Observability Standard
Retry Amplification = A = base_req × (1 + retry_rate)
High error rates compound proxy costs exponentially. Infrastructure Cost Modeling
DataFlirt Delivery SLO = S = 1 − (quarantined_records / expected_records)
We guarantee >99% valid data delivery, absorbing the raw error rate internally. Client SLA
// 04 — error telemetry

When the pipeline
starts bleeding.

A live trace of an e-commerce pipeline experiencing a sudden layout change, showing how extraction errors are caught before they poison the dataset.

schema validationquarantineauto-pause
edge.dataflirt.io — live
CAPTURED
// job start: target_catalog_IN
req.status: 200 OK
req.latency: 840ms
dom.parse: success

// extraction phase
field.title: "Samsung 55-inch QLED"
field.price: null // selector .price-box-main failed
field.stock: true

// validation
schema.check: FAIL
error.type: "missing_required_field"
action: quarantine_record

// pipeline health monitor
metric.error_rate_5m: 14.2% // threshold exceeded
alert.triggered: "schema_drift_detected"
pipeline.state: PAUSED
// 05 — failure domains

Where the errors
actually come from.

Based on telemetry across DataFlirt's managed pipelines, here is the distribution of error types. Network instability is constant, but schema drift is the most disruptive.

PIPELINES MONITORED ·   300+ active
WINDOW ·  ·  ·  ·  ·  ·   30d trailing
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Schema drift / Selector rot

42% of errors · Silent failures caught by validation
02

Anti-bot blocks / 403s

28% of errors · Fingerprint or IP reputation flags
03

Proxy timeouts

18% of errors · Residential node churn
04

Target server 503s

9% of errors · Upstream capacity limits
05

DNS resolution failures

3% of errors · Edge routing issues
// 06 — observability

Don't just count errors,

classify, isolate, and route them.

At DataFlirt, an error is treated as a routing decision. If a request hits a 429 Too Many Requests, the scheduler automatically backs off the concurrency for that specific target. If it hits a 403 Forbidden, the session is burned and the proxy ASN is temporarily down-weighted. If it hits a schema validation error, the request is successful but the data is quarantined. By treating errors as granular signals rather than a monolithic metric, we maintain 99.9% data delivery even when the underlying fetch success rate fluctuates.

Error Routing Logic

How our orchestrator handles different failure modes in real time.

err.429_too_many throttle concurrencyretry
err.403_forbidden burn sessionrotate fingerprint
err.503_unavailable exponential backoffretry
err.404_not_found drop URLdo not retry
err.schema_invalid quarantine recordalert engineer
err.proxy_timeout rotate IPretry immediately

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about acceptable failure rates, retry strategies, and how DataFlirt keeps pipelines resilient at scale.

Ask us directly →
What is an acceptable error rate for a scraping pipeline? +
Sub-2% for stable targets, up to 5% for highly volatile ones. Anything higher means you are burning money on retries and proxy bandwidth. If your error rate is consistently 0%, your monitoring is likely swallowing silent failures like empty JSON payloads or CAPTCHA pages returning 200 OK.
Should I retry every failed request? +
No. Retry 502s, 503s, and network timeouts. Never retry 404s, 400s, or schema validation failures. Retrying a 403 Forbidden without rotating the proxy and browser fingerprint just gets you banned faster and wastes resources.
How do anti-bot systems affect error rates? +
They often disguise blocks. A sophisticated anti-bot system won't always return a 403; it might return a 200 OK with a CAPTCHA challenge or poisoned, fake data. If you only monitor HTTP status codes, your error rate will look perfect while your database fills with garbage.
How does DataFlirt handle sudden error spikes? +
Our orchestrator pauses the specific job, alerts an on-call engineer, and preserves the queue state. We fix the selector or adjust the proxy routing, then resume the job from where it left off. The client receives their dataset on time, completely unaware of the mid-flight repair.
Does a high error rate pose legal risks? +
Yes. Bombarding a server with failing requests—especially 403s or 429s—can be construed as a Denial of Service attack or a CFAA violation. Respecting rate limits and backing off on errors is a core compliance requirement for any legitimate scraping operation.
How do you calculate error rate when using a proxy pool? +
You must separate proxy-level errors (e.g., 407 Proxy Auth Required, connection resets) from target-level errors. Proxy errors indicate infrastructure issues; target errors indicate scraping logic or anti-bot issues. Blending them into one metric makes debugging impossible.
$ dataflirt scope --new-project --target=error-rate-(scraping) READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h