← Glossary / Crawl Delay

What is Crawl Delay?

Crawl delay is the minimum wait interval a crawler enforces between consecutive requests to the same host — declared by the target in robots.txt or imposed by the crawler's own rate-limiter to avoid triggering detection and overloading origin servers. Ignore it and you get blocked; respect it naively across a 50-million-URL frontier and your pipeline takes six months to complete.

robots.txtRate limitingPolitenessSchedulerAnti-bot
// 02 — definitions

Slow down
on purpose.

Crawl delay is not a courtesy setting — it's a scheduling constraint that determines throughput, detectability, and whether you finish the job before the data goes stale.

Ask a DataFlirt engineer →

TL;DR

The <code>Crawl-delay</code> directive in robots.txt tells compliant bots how many seconds to wait between requests per host. Most serious targets set 2–10 seconds. At 5s delay on a single connection, crawling 1M pages takes 58 days — which is why production pipelines distribute across thousands of concurrent sessions, each honoring per-host delay independently.

01Definition & structure
Crawl delay is a per-host scheduling constraint. It has two sources:
  • robots.txt — the Crawl-delay: N directive, scoped to a User-agent group
  • Crawler self-regulation — a politeness floor the crawler enforces regardless of what the target declares
The effective delay applied is the maximum of any declared directive and your own floor. Most schedulers track this per host, not globally — a 5-second delay on one host does not slow requests to other hosts running in parallel. The delay clock resets on each successful fetch, not on each request dispatch.
02How it works in practice
Your scheduler maintains a per-host token. When a URL is dequeued for a given host, the scheduler checks now - last_fetch_time >= crawl_delay. If the condition isn't met, the URL goes back into a hold state and the next available host slot is tried. This is why frontier design matters: a frontier that can only hold one host at a time will idle constantly on a 5s delay. Production frontiers hold thousands of host queues simultaneously so worker threads are never waiting on a single clock.
03Delay vs. adaptive backoff
Crawl delay is proactive — applied unconditionally before a request. Adaptive backoff is reactive — triggered by a 429 or 503 response. Both constrain the same per-host slot, and they compound. If your crawl delay is 3s and you hit a 429 with Retry-After: 60, the host slot locks for 60s — not 60 + 3s, just 60s (since 60 > 3). The two mechanisms are orthogonal: never mistake a rising 429 rate for a tuning problem with your delay setting. 429s mean you're overrunning server capacity or session fingerprint thresholds — not that your delay is slightly wrong.
04How DataFlirt handles it
We parse and cache robots.txt per host with a 6-hour TTL, and feed Crawl-delay values directly into our distributed scheduler's per-host token store (Redis). Our own politeness floor sits at 1.5 seconds — no host gets hit faster than that even if robots.txt is absent. For high-value targets where we've negotiated or observed aggressive rate-limiting, we apply custom per-host overrides. The 429 rate across our active pipelines runs below 0.2% — meaning the delay logic is calibrated well enough that targets almost never push back.
05The throughput trap: delay × scale
Teams routinely underestimate how much crawl delay compounds at scale. A 3-second delay on a single session caps you at 20 requests/minute — fine for a hobby project, useless for a 5M-product catalog refresh. The fix is not to violate the delay but to scale sessions: 200 concurrent sessions at 3s each yields 4,000 req/min on the same host. The Googlebot fleet famously runs millions of parallel connections, each individually polite. Scale is not the enemy of politeness — it's the mechanism that makes politeness economically viable.
// 03 — the math

How delay
shapes throughput.

These three formulas are what every crawl scheduler solves at runtime. Get the delay budget wrong and you either block your own pipeline or saturate the target and trigger rate-limiting defenses.

Effective throughput = T = N / (D + L)
N = concurrent sessions, D = crawl delay (s), L = avg latency (s). Only N scales throughput at fixed D. Standard crawler design
Time to crawl frontier = t = (F · D) / N
F = frontier size (URLs), D = delay, N = parallel sessions. Shows why delay dominates schedule length. Mercator, 1999
DataFlirt per-host concurrency cap = Chost = floor(Ppool / Hactive)
Proxy pool size divided by active host count — caps sessions per origin without starving the frontier. Internal scheduler v3.1
// 04 — what the scheduler sees

A host queue tick,
inside the rate-limiter.

Every host gets an independent delay clock. Here's a trace from our scheduler processing a single domain slot — from frontier pop to next-allowed-at update.

Crawl-delay: 3s429 watchpoliteness token
edge.dataflirt.io — live
CAPTURED
// robots.txt parse — cached 6h
host: "shop.example.com"
crawl_delay_directive: 3 // seconds, from Crawl-delay: 3
override_floor: 1.5 // our min politeness floor
effective_delay: 3.0 // directive wins

// scheduler tick — host slot opens
frontier.pop: "https://shop.example.com/products/42891"
last_fetched_at: 1716294871.224
now: 1716294874.451
delta: 3.227s ≥ 3.0s — OK to fetch

// after fetch completes
response.status: 200
next_allowed_at: 1716294877.451 // now + 3.0s
host_slot.released: true
// 05 — delay sources

Where crawl delay
actually comes from.

Crawl delay is rarely a single number from a single place. In practice, the effective delay applied to any given host is the maximum of several competing constraints — some declared, some inferred, some self-imposed.

COMMON RANGE ·  ·  ·  ·   1–10 seconds
DEFAULT (no directive)  500ms–2s
429 BACKOFF TYPICAL ·   30–120 seconds
01

robots.txt Crawl-delay directive

explicit · Declared by site owner; most crawlers honor it
02

429 / 503 adaptive backoff

reactive · Rate-limit response triggers exponential wait
03

Crawler politeness floor

self-imposed · Hard minimum regardless of directive
04

Proxy pool availability

resource · No free session = implicit delay
05

JavaScript render time

latency · Browser overhead adds ~1–4s per page
// 06 — our scheduler

Per-host clocks,

not a global throttle.

Naive crawlers apply one rate limit globally. That's wrong — it either over-throttles small hosts or under-throttles large ones. DataFlirt runs per-host delay tokens in a distributed Redis clock so every origin gets exactly its declared wait, regardless of how many other hosts are being crawled simultaneously.

Host delay token — live state

A snapshot of a single host slot as managed by our scheduler cluster.

host catalog.retailer.in
crawl_delay 5s (robots.txt)honored
last_fetch 2026-05-21T08:14:22Z
next_allowed 2026-05-21T08:14:27Zcountdown: 2.1s
backoff_active false
session_pool residential_IN · 3 activehealthy
429_count_1h 0clean

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About crawl delay, robots.txt compliance, rate-limiting, and how to keep a high-throughput pipeline polite without sacrificing speed.

Ask us directly →
Does robots.txt Crawl-delay have legal force? +
No — robots.txt is a convention, not an RFC or statute. Violating it won't get you arrested, but ignoring it aggressively enough to impair the server can expose you to CFAA claims in the US or IT Act liability in India. More practically, targets that see their delay directive ignored tend to upgrade to hard rate-limiting or block ranges.
What crawl delay should I use when robots.txt doesn't specify one? +
A conservative starting floor is 1–2 seconds for high-traffic commercial sites, 500ms for well-resourced CDN-fronted targets. Watch your 429 rate — if it climbs above ~0.5% of requests, you're pushing too fast. Adaptive backoff handles the rest.
How do I crawl 10 million pages quickly if each host enforces a 5-second delay? +
Horizontal scale across sessions, not per-session speed. With 500 concurrent sessions, each on a different IP, a 5s-delay host still yields 100 pages/second — that's 8.6 million pages/day. The trick is keeping the session pool fed and the frontier distributed so no session ever idles waiting for a slot.
Does crawl delay apply per IP or per crawler identity? +
Technically per User-agent string in the robots.txt spec. In practice, most anti-bot systems enforce delay per IP or per session fingerprint — not per bot name. Which means rotating IPs resets the clock on the server side, independent of what your robots.txt parser is tracking locally.
What happens when a target returns 429 — how should the crawler respond? +
Back off immediately and exponentially. First 429: wait the Retry-After value or 30s if absent. Subsequent 429s from the same host: double the interval up to a cap (typically 5–15 minutes). Log the event — a spike in 429s usually means a delay floor needs adjustment, not just a transient overload.
Can crawl delay settings be different for different paths on the same host? +
robots.txt Crawl-delay is host-scoped, not path-scoped. You can declare different delays for different User-agents, but not for different URL patterns. If you need path-level throttle differentiation, implement it in your scheduler logic — e.g. a shorter delay for product pages, a longer one for search result pages that are more expensive to render.
$ dataflirt scope --new-project --target=crawl-delay READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h