← Glossary / Scrape Rate

What is Scrape Rate?

Scrape rate is the number of pages or requests a pipeline sends to a target per unit of time — typically expressed as requests per second, per minute, or per hour. It controls the tradeoff between data freshness and detection risk: higher rates produce fresher data but consume more IP budget, stress the target's rate-limiting infrastructure, and increase the probability of triggering bot detection. Scrape rate is not a free variable — every target has an effective ceiling, and exceeding it reliably causes blocks.

InfrastructureRate LimitingPolitenessThroughputBot Detection
// 02 — definitions

Fast enough
to matter.

Scrape rate sets the ceiling on data freshness. But rate and detection risk scale together — the right rate is the fastest you can go without signalling non-human behaviour to the target.

Ask a DataFlirt engineer →

TL;DR

Scrape rate is requests per unit time from your pipeline to a target. Too low and your data is stale. Too high and you're blocked. The right rate is specific to each target, each proxy, and each time of day — and it changes as targets update their detection logic. Most production pipelines run at 0.3–2 req/s per exit IP; aggregate throughput comes from pool size, not from pushing individual IPs faster.

01Definition & units

Scrape rate is measured at three levels simultaneously:

  • Per IP — requests per second from a single exit node to a single domain. This is the detection-relevant unit.
  • Per pipeline — aggregate requests per second across all IPs in the pool. This is the throughput unit.
  • Per endpoint — requests per second to a specific URL pattern. Rate limits are often endpoint-specific, not domain-wide.

These three rates can differ dramatically. A pipeline running 20 IPs at 0.5 req/s each has a per-IP rate of 0.5 and a pipeline rate of 10. Both numbers matter for different reasons.

02Rate and detection risk

Detection risk doesn't scale linearly with rate — it scales faster. An IP making 1 req/s blends into normal user behaviour. The same IP making 5 req/s is in the top 0.01% of all users. Anti-bot classifiers flag velocity as one of the strongest bot signals precisely because real users almost never sustain high request rates.

This is why production pipelines distribute load across large IP pools at modest per-IP rates, rather than pushing a small number of IPs fast. The aggregate throughput can be identical; the detection profile is completely different.

03Adaptive rate control

A fixed scrape rate set at pipeline launch is wrong within days. Target sensitivity changes as anti-bot vendors update their models. IP reputation changes as IPs age. Server load changes by hour of day and day of week.

Production schedulers adjust rate continuously: reduce on block signal, increase gradually after a clean streak, back off during peak target traffic hours, and restore after cooldown. The goal is to always be just below the detection threshold, not to hit a configured number regardless of what's happening on the target.

04How DataFlirt manages rate per pipeline

Each pipeline has a configured throughput target and a per-IP rate ceiling. The scheduler allocates pool IPs to hit the throughput target while staying within the per-IP ceiling. When an IP is blocked or cooled down, the scheduler redistributes load to the remaining pool — the throughput target is maintained as long as enough IPs are available.

We publish a freshness SLO per pipeline: for a target with 50,000 pages at 5 req/s aggregate, a full refresh cycle takes ~2.8 hours. Clients see that number, not just a raw rate figure.

05The rate limit that isn't in the HTTP response

Many targets don't return 429. They return 200 with degraded content — a bot-wall page, a CAPTCHA challenge HTML body, or a silently empty data section. The HTTP layer looks clean; the data layer is broken.

Rate limit detection that only watches HTTP status codes misses the majority of actual rate limit events on sophisticated targets. The correct monitoring target is response body validation — hash the structure of successful responses and alert when the structure changes, regardless of status code.

// 03 — the model

Rate, throughput,
and detection risk.

Throughput and detection risk are coupled through the per-IP rate. These three relationships define the parameter space DataFlirt's rate scheduler operates within for every pipeline.

Aggregate throughput = T = rate_per_ip × pool_size
Scale by pool size, not by pushing individual IPs harder. DataFlirt scheduler model
Detection risk (simplified) = R ∝ rate_per_ip2 × session_age
Risk scales superlinearly with rate. Doubling rate more than doubles detection risk. Empirical fit, DataFlirt fleet data
Data freshness = F = 1 / (pages_in_scope / throughput)
Time to recrawl full scope = pages ÷ throughput. F is recrawls per unit time. Pipeline freshness SLO
// 04 — rate scheduler trace

Rate decisions
in real time.

Rate scheduler log for one active pipeline. Shows per-IP rate assignment, aggregate throughput calculation, and a rate reduction triggered by a block signal on one exit node.

adaptive rate2 req/s aggregateblock signal detection
edge.dataflirt.io — live
CAPTURED
// pipeline config
target: "indiamart.com/proddetail"
rate.base: 0.5 req/s // per IP
pool.size: 8 IPs
throughput.target: 4.0 req/s

// 14:32:01 — normal operation
ip.49.36.xx.1: 0.5 req/s · 200 OK
ip.49.36.xx.2: 0.5 req/s · 200 OK
ip.49.36.xx.3: 0.5 req/s · 200 OK

// 14:34:17 — block signal on one IP
ip.49.36.xx.4: 429 Too Many Requests
scheduler.action: "cooldown ip.49.36.xx.4 for 15m"
scheduler.action: "redistribute to pool remainder"
throughput.current: 3.5 req/s // 7 IPs × 0.5

// 14:49:17 — IP restored
ip.49.36.xx.4: resumed · 0.3 req/s // conservative re-entry
// 05 — rate signals

What sets the
right rate.

Five factors that determine the sustainable scrape rate for a given target and proxy combination. These are not independent — a change in any one shifts the others.

PIPELINES TRACKED ·  ·    300+ active
RATE ADJUSTMENTS ·  ·  ·  per session
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Target rate limit sensitivity

req/s ceiling · Platform-level, varies by endpoint
02

IP reputation / prior history

trust score · Fresh IPs tolerate higher rates
03

Session age

requests per IP · Old sessions attract more scrutiny
04

Time of day / traffic pattern

target server load · High-traffic hours are more forgiving
05

Response time variance

p95 latency · Slow targets need lower rates
// 06 — DataFlirt's adaptive rate scheduler

Not a fixed rate.

a continuous negotiation.

DataFlirt's rate scheduler adjusts per-IP rates in real time based on response signals: 429s trigger immediate cooldown, response time spikes trigger rate reduction, and clean 200 streaks allow gradual rate increase up to the configured ceiling. The scheduler treats each IP-target pair as an independent channel with its own current rate, not a single global throughput dial.

Rate scheduler state

Live rate allocation across the IP pool for one active scraping pipeline.

pipeline.throughput 4.0 req/s target
pool.active 8 IPs
rate.per_ip 0.5 req/s
blocks.last_1h 1 IP · 429
cooldowns.active 0 IPs
throughput.actual 3.9 req/s
freshness.full_cycle ~42 min at current rate

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About sustainable rate limits, adaptive scheduling, the relationship between rate and detection, and how DataFlirt manages throughput across multi-IP pools.

Ask us directly →
What scrape rate should I use? +
Start at 0.5 req/s per exit IP and measure block rate over 1,000 requests. If block rate is zero, increase by 0.2 req/s and repeat. Stop when you see the first 429 or bot-wall response. The sustainable rate is roughly 0.7× the rate where you first see blocks. Never push a single IP above 2 req/s on retail targets — rate and detection risk scale superlinearly.
How do I increase throughput without increasing per-IP rate? +
Add IPs to the pool. If 1 IP at 0.5 req/s is sustainable, 20 IPs at 0.5 req/s each gives you 10 req/s aggregate throughput. This is how production pipelines scale — pool size is the throughput lever, not pushing individual IPs faster. A single IP that handles 5 req/s is far more exposed than five IPs at 1 req/s each.
What's the difference between scrape rate and crawl delay? +
Scrape rate is your operational metric — how fast your pipeline actually runs. Crawl delay is the minimum wait between requests to a domain, specified in robots.txt via Crawl-delay directive. Scrape rate should always be at or below the crawl delay ceiling. Ignoring crawl delay is both bad practice and a reliable path to IP-level blocks.
How does rate limiting differ between endpoints on the same site? +
Significantly. Category listing pages often tolerate higher rates than product detail pages. API endpoints have different limits than HTML pages. Search endpoints are almost always the most sensitive. Measure each endpoint class separately and set rate limits per endpoint, not per domain. A global domain rate limit will either be too conservative for listings or too aggressive for search.
How do I detect that I've hit a rate limit? +
In order of reliability: HTTP 429 (explicit), HTML body containing bot-wall content despite 200 status (silent block), response time increase of 3× or more (server-side throttling), and empty response bodies. Monitor all four. Silent 200s with bot-wall HTML are the hardest to catch — validate response bodies, not just status codes.
Can DataFlirt maintain a specific throughput SLO? +
Yes, with pool-size guarantees. If your pipeline needs 10 req/s sustained against a specific target, we size the IP pool and set per-IP rates to deliver that with headroom for block events. We monitor actual vs target throughput and automatically expand the active pool when blocks reduce effective rate below the SLO threshold.
$ dataflirt scope --new-project --target=scrape-rate READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h