← Glossary / Polite Crawling

What is Polite Crawling?

Polite crawling is the practice of rate-limiting requests to a target server — respecting Crawl-delay directives in robots.txt, honouring Retry-After headers, and spacing fetches to avoid saturating the target's infrastructure. It's not altruism: a crawler that hammers a server triggers IP bans, Cloudflare rate-limiting, and anti-bot escalation faster than almost any other signal. Politeness keeps you in the game long enough to complete the extraction.

CrawlingRate Limitingrobots.txtCrawl-delayServer Etiquette
// 02 — definitions

Don't be
the DDoS.

From the server's perspective, an impolite crawler and a low-grade DDoS look identical — both trigger the same defences. Understanding where that line is drawn determines whether you finish the job or get blocked mid-run.

Ask a DataFlirt engineer →

TL;DR

Polite crawling means: read and respect <code>robots.txt</code> Crawl-delay, back off on 429s and 503s using Retry-After, space requests to avoid burst traffic, and rotate user-agents to avoid looking like one client hammering from one IP. The specific rate varies by target — a large CDN-backed retailer can handle 5 req/s per IP; a small independently hosted site might rate-limit at 0.5 req/s. DataFlirt calibrates crawl rates per target using a ramp-up probe before committing to a production rate.

01Definition & structure
Polite crawling has three components that work together:
  • Crawl-delay compliance — reading robots.txt and enforcing the declared minimum gap between requests
  • Backoff on error signals — honouring Retry-After on 429 and 503 responses, or applying exponential backoff when no directive is present
  • Concurrency control — limiting simultaneous connections per IP, since 10 parallel threads at 1 req/s is the same as 1 thread at 10 req/s from the server's perspective
All three must be active simultaneously. Honouring Crawl-delay while running 20 concurrent connections defeats the purpose entirely.
02How it works in practice
Before the first fetch, the crawler reads robots.txt and parses the Crawl-delay for its user-agent (and the wildcard * entry as fallback). That value becomes the minimum inter-request delay. A token bucket or leaky bucket rate limiter enforces it. On every response, the crawler checks for Retry-After. On 429 or 503, it backs off for at least the declared duration — or uses exponential backoff if no header is present — then resumes at a reduced rate. After N consecutive successful requests, it can step the rate back up to the calibrated maximum.
03Finding the safe crawl rate for an unknown target
Crawl-delay declarations tell you the minimum. They don't tell you the maximum safe rate for your pipeline — a site might declare 1s but block aggressively at 0.5 req/s if the underlying server is under-provisioned. The correct approach is a ramp-up probe: start at 0.1 req/s, increase by 0.1 req/s every 60 seconds, and record the rate at which 429s first appear. Set the production rate at 80% of that threshold. This takes under 30 minutes and prevents bans that take days to clear.
04How DataFlirt handles it
Every DataFlirt pipeline starts with a rate calibration pass — ramp-up probe, threshold detection, production rate set at 80% of threshold. We apply ±20% jitter on every delay interval by default. Concurrency is always 1 connection per IP unless the target has been explicitly tested for higher concurrency. Our rate limiter is per-IP and per-domain independently — a session using multiple IPs enforces the rate limit on each IP separately, not aggregated across the pool.
05Common misconception: politeness only matters for small sites
Large CDN-backed sites have higher raw throughput capacity but more sophisticated rate detection. Cloudflare's rate limiting is per-IP, per-path-prefix, and per-ASN simultaneously — you can be within per-IP limits and still trigger a block at the path-prefix level if your crawler concentrates all fetches on /products/. Politeness on large sites means distributing load across paths and time windows, not just slowing down per-request. The detection is smarter; the strategy needs to match.
// 03 — the model

What rate is
safe for this target?

There's no universal safe crawl rate. The right number depends on server capacity, CDN presence, and existing bot defences. DataFlirt's rate calibration runs a ramp-up probe on every new target to find the threshold before setting the production rate below it.

Effective crawl rate = Reff = 1 / (Crawl-delay + fetch_time)
Actual throughput is always lower than 1/Crawl-delay because fetch time adds to the gap. robots.txt spec — REP RFC 9309
Backoff on 429 / 503 = wait = Retry-After ?? min(2attempt × base_delay, max_delay)
Honour Retry-After if present; fall back to exponential backoff with a cap. HTTP RFC 9110 §15.5.29
Burst risk index = BRI = (req_count / window_s) / R_safe
BRI > 1.0 means you're above the safe rate — ban probability rises sharply. DataFlirt rate calibrator, v2026
// 04 — rate limiter trace

Throttle, backoff,
and recover.

A polite crawl session hitting a mid-size retail site. The rate limiter catches a 429, backs off correctly, and resumes without triggering a harder block.

Crawl-delay: 2sR_safe: 0.45 req/sbackoff: exponential
edge.dataflirt.io — live
CAPTURED
// robots.txt directive loaded
Crawl-delay: 2 // target declared 2s minimum gap
R_target: 0.45 req/s // we set 10% below to be safe

// normal operation
GET "/products/page/1" 200 OK 312ms
GET "/products/page/2" 200 OK 289ms
GET "/products/page/3" 200 OK 341ms

// rate limit hit
GET "/products/page/4" 429 Too Many Requests
Retry-After: 60 // server said wait 60s
action: BACKOFF 60s // honouring directive

// resumed after backoff
GET "/products/page/4" 200 OK 304ms
R_adjusted: 0.35 req/s // stepped down after 429
session.ban_events: 0
// 05 — politeness signals

What servers use
to detect rudeness.

Anti-bot systems and CDN rate limiters watch for specific patterns that separate polite crawlers from hammers. Understanding which signals matter most determines where to focus your rate-control strategy. Ranked by how quickly each triggers a block.

TARGETS CALIBRATED ·  ·   340+ sites
AVG SAFE RATE ·  ·  ·  ·  0.5–2 req/s per IP
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Request rate per IP

fastest block trigger · burst above threshold → instant 429 or ban
02

Crawl-delay compliance

hard signal · ignoring declared delay → escalation fast
03

Retry-After compliance

strong signal · ignoring it → progressive IP blacklist
04

Concurrent connections per IP

medium signal · 20 parallel threads ≡ 20× the rate
05

Request timing variance

soft signal · perfectly regular cadence looks non-human
// 06 — our approach

Calibrated before

the first production fetch.

We run a ramp-up probe on every new target: start at 0.1 req/s, step up by 0.1 every 60 seconds, watch for the first 429 or 503, then set production rate at 80% of that threshold. The probe costs 20–30 fetches. It saves the pipeline from a ban on page 1 of production. We also inject jitter (±15–25% on each delay interval) to avoid the perfectly regular cadence that some bot detectors flag explicitly.

Rate calibration — live session

Ramp-up probe results for a new e-commerce target before production crawl.

robots.Crawl-delay 1sread
probe.threshold 0.9 req/s429 at this rate
production.rate 0.72 req/s80% of threshold
jitter.range ±20%applied
concurrency.per_ip 1 connectionenforced
backoff.strategy Retry-After → expactive
ban_events.30d 0clean

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About crawl rate limits, robots.txt compliance, backoff strategies, and how DataFlirt calibrates crawl speed for every target.

Ask us directly →
What Crawl-delay should I use if robots.txt doesn't specify one? +
Start conservative: 1–2 seconds per request per IP. Run a ramp-up probe to find where 429s start appearing, then set your production rate at 80% of that threshold. Large CDN-backed sites (Cloudflare, Fastly) can often handle 2–5 req/s. Small independent sites may rate-limit at 0.2–0.5 req/s. There's no safe universal default — probe first.
Does respecting robots.txt mean I have to obey Crawl-delay exactly? +
robots.txt Crawl-delay sets a minimum gap between requests. Going slower than declared is always safe. Going faster is what triggers blocks. If the declared delay is 5 seconds and your pipeline needs more throughput, the answer is more IPs with independent rate limiting — not ignoring the directive on a single IP.
What's the difference between a 429 and a 503 during crawling? +
A 429 (Too Many Requests) means you specifically hit a rate limit — slow down. A 503 (Service Unavailable) may mean the server is overloaded, which could be caused by your crawl or be entirely unrelated. Both should trigger a backoff, but 429 is the cleaner signal to act on. Always check for a Retry-After header on both.
Does adding jitter to delays actually matter? +
Yes, for some detection stacks. Perfectly regular request cadence (exactly 2.000s between every request) is a bot signal — no human or realistic application produces it. Adding ±15–25% random jitter makes the timing distribution look closer to a real browser. It won't fool a sophisticated classifier on its own, but it removes a cheap signal.
Can you run multiple parallel connections and still be polite? +
The Crawl-delay applies per crawler, not per connection. Multiple concurrent connections from the same IP multiply your effective request rate. If Crawl-delay: 2 means 0.5 req/s and you open 10 connections, you're hitting 5 req/s — the same as ignoring the directive. Polite parallelism means more IPs, not more connections per IP, each IP rate-limited independently.
Is polite crawling legally required? +
Not in most jurisdictions in any direct sense — there's no statute mandating Crawl-delay compliance. But excessive crawl rates can form the basis of a computer abuse or unauthorised access claim in some jurisdictions, and courts have looked at whether the crawler obeyed stated access controls when assessing legality. Operationally, ignoring politeness is also just bad strategy — it gets you banned faster than any legal argument matters.
$ dataflirt scope --new-project --target=polite-crawling READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h