← Glossary / Pages Per Minute

What is Pages Per Minute?

Pages per minute (PPM) is the aggregate throughput metric of a scraping pipeline, measuring how many complete, validated HTML documents or JSON payloads are successfully fetched and parsed within a 60-second window. While engineers often focus on requests per second (RPS) at the network layer, PPM is the actual business metric: it accounts for retries, proxy timeouts, CAPTCHA blocks, and extraction failures. A high RPS with a low PPM means you are burning bandwidth on failed requests.

ThroughputPipeline MetricsConcurrencyYield RatioSLA

// 02 — definitions

Throughput,
measured in value.

Why raw request rates are a vanity metric, and how to measure the actual velocity of your data extraction pipeline.

Ask a DataFlirt engineer →

TL;DR

Pages per minute tracks the number of successful, fully extracted records your pipeline produces. It is the ultimate denominator for cost-per-record calculations. If your RPS is 50 but your PPM is 300, your pipeline has a 90% failure or retry rate — usually due to proxy exhaustion or aggressive anti-bot tarpitting.

01Definition & structure

Pages per minute (PPM) is the definitive measure of a scraping pipeline's throughput. Unlike Requests Per Second (RPS), which only measures outbound network activity, PPM measures the volume of completed work: pages fetched, parsed, validated, and stored. It accounts for the entire lifecycle of a scrape job, including proxy negotiation, TLS handshakes, DOM parsing, and schema validation.

02RPS vs PPM

RPS is an infrastructure metric; PPM is a business metric. If a target implements a soft block (e.g., a CAPTCHA challenge), your scraper might retry the request 5 times before succeeding. Your RPS will spike, but your PPM will plummet. Tracking the ratio between the two (the Yield Ratio) is the most reliable way to detect silent blocking and proxy degradation.

03The bottleneck stack

PPM is constrained by the slowest component in your pipeline. Common bottlenecks include:

Network Latency: Slow proxies or geographically distant target servers.
Compute Overhead: Heavy JavaScript rendering in headless browsers.
Parsing Inefficiency: Complex XPath queries blocking the CPU thread.
Database I/O: Slow write speeds when persisting extracted records.

04How DataFlirt handles it

We design our pipelines around guaranteed PPM SLAs. Instead of static concurrency limits, our orchestration engine dynamically scales worker nodes based on real-time latency and success rates. If a target's response time degrades from 500ms to 2000ms, our system automatically quadruples the concurrency to maintain the agreed-upon PPM, ensuring data feeds are delivered on schedule regardless of target performance.

05The concurrency trap

A common mistake is attempting to increase PPM simply by increasing concurrency. If the bottleneck is target server capacity or WAF rate limits, adding more concurrent requests will actually decrease your PPM as the server begins dropping connections or issuing 429 Too Many Requests errors. Sustainable PPM scaling requires distributing requests across a wider IP pool and randomizing request intervals.

// 03 — throughput math

How do you
calculate true PPM?

PPM is a function of concurrency, latency, and success rate. DataFlirt's scheduler uses these variables to autoscale worker nodes dynamically and maintain strict delivery SLAs.

Effective PPM = Concurrency × (60 / Avg_Latency_sec) × Success_Rate

The baseline throughput equation for any distributed crawler. Queueing Theory

Yield Ratio = PPM / (RPS × 60)

Measures pipeline efficiency. < 0.5 indicates severe blocking or timeouts. DataFlirt SLOs

Cost per 1M Pages = (1,000,000 / PPM) × Cost_per_minute

Infrastructure cost normalized to successful throughput. FinOps Standard

// 04 — pipeline telemetry

Monitoring PPM
across a distributed fleet.

Live telemetry from a DataFlirt worker cluster scraping a major real estate portal. Notice the gap between attempted requests and successfully extracted pages.

Prometheusworker-node-04real-estate-IN

edge.dataflirt.io — live

CAPTURED

// 1-minute trailing window
http.requests.attempted: 14,250
http.requests.success_200: 12,105
http.requests.blocked_403: 2,145

// extraction layer
parser.documents.processed: 12,105
parser.validation.passed: 11,890
parser.validation.failed: 215 // schema drift

// throughput metrics
metric.rps_outbound: 237.5
metric.ppm_effective: 11,890
metric.yield_ratio: 0.83 // acceptable

// autoscaler
target.ppm_sla: 15,000
status: "scaling workers +4 to meet SLA"

// 05 — throughput killers

Where your PPM
actually goes.

The most common reasons a high-concurrency pipeline fails to deliver a high pages-per-minute yield, based on DataFlirt's incident post-mortems.

PIPELINES ANALYZED · · 850+

METRIC · · · · · · Yield Drop Causes

UPDATED · · · · · · 2026-05-19

01

Anti-bot tarpitting

silent retries · Server holds connection open to bleed concurrency

02

Proxy pool exhaustion

timeouts · High RPS leads to IP bans and connection drops

03

DOM parsing bottlenecks

CPU bound · Heavy XPath/CSS selectors blocking the event loop

04

Target rate limits

HTTP 429s · Hard limits enforced by the origin server WAF

05

Headless browser overhead

memory thrashing · Playwright/Puppeteer contexts crashing under load

// 06 — DataFlirt architecture

Scale the yield,

not just the request rate.

At DataFlirt, we don't bill on requests — we bill on successful records. This aligns our infrastructure incentives with your business goals. Our orchestration layer continuously monitors the yield ratio (PPM vs RPS). If a target starts tarpitting our IPs, throwing more concurrency at it will just burn proxy bandwidth. Instead, the scheduler automatically rotates the fingerprint profile, shifts the ASN mix, and throttles the RPS until the yield ratio recovers, ensuring a stable PPM without triggering permanent bans.

worker-throughput.json

Real-time throughput configuration for a DataFlirt worker node.

target.domain retail-catalog-eu

concurrency.max 120 threads

latency.p95 850ms

ppm.current 8,450

yield.ratio 0.94

throttle.threshold yield < 0.75

status optimal

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about scaling throughput, managing concurrency, and optimizing pages per minute in production scraping pipelines.

Ask us directly →

Why is my RPS high but my PPM low? +

You have a low yield ratio. This usually means your requests are failing, timing out, or returning CAPTCHAs, forcing your pipeline to retry. You are generating network traffic but not extracting data. Check your proxy success rates and anti-bot block rates.

How do I increase my pages per minute? +

The naive answer is to increase concurrency (more threads/workers). The correct answer is to optimize latency and success rate first. Use faster proxies, strip unnecessary headers, block image/font loading in headless browsers, and ensure your IP reputation is clean to avoid tarpitting.

What is a good PPM for a single worker node? +

It depends entirely on the target and the stack. A single Node.js worker doing pure HTTP/JSON scraping can easily hit 3,000+ PPM. If you are running Playwright with full JavaScript rendering, a single 4-core worker might max out at 60–100 PPM before memory thrashing occurs.

How does DataFlirt guarantee PPM SLAs? +

We over-provision our worker pools and use dynamic routing. If a target slows down (increasing latency) or blocks a subnet (decreasing success rate), our orchestrator automatically spins up additional workers in different regions to maintain the aggregate PPM required to hit your delivery deadline.

Does increasing PPM increase my risk of getting blocked? +

Yes, if you don't scale your proxy pool proportionally. If you double your PPM using the same 1,000 residential IPs, the request density per IP doubles, making you highly visible to rate-limiting WAFs. High PPM requires a massive, distributed IP pool.

Should I measure PPM before or after data validation? +

Always after. A page that returns a 200 OK but contains an "Access Denied" message or an empty JSON array is not a successful page. True PPM only counts records that pass your schema validation and are written to your database or delivery bucket.

$ dataflirt scope --new-project --target=pages-per-minute READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h