← Glossary / Scraping Frequency

What is Scraping Frequency?

Scraping frequency is how often a pipeline re-runs against the same target — daily, hourly, every 15 minutes, or continuously. It determines data freshness at the dataset level, as opposed to scrape rate which controls request speed within a single run. Frequency is a business decision with infrastructure consequences: higher frequency means fresher data, higher proxy costs, more extraction runs, and more cumulative exposure to target anti-bot systems. Getting it wrong in either direction costs money — too low means stale data that no one trusts, too high means a spend on freshness that the use case doesn't require.

InfrastructureFreshnessSchedulingCostSLO

// 02 — definitions

How fresh
does it need to be?

Frequency is a business question first and an infrastructure question second. The right answer depends on how quickly the underlying data changes and how quickly stale data causes a downstream decision error.

Ask a DataFlirt engineer →

TL;DR

Scraping frequency controls how often the pipeline re-runs and therefore how old the data can get between refreshes. Real-time price monitoring needs sub-minute frequency. Competitive intelligence might tolerate daily. Choosing frequency without measuring data change velocity is guesswork — over-frequenting wastes infrastructure budget; under-frequenting produces datasets that downstream consumers quietly stop trusting.

01Definition & the two axes of throughput

Scraping has two distinct throughput dimensions that are often conflated:

Scrape rate — requests per second within a single run. Controls speed of execution and detection risk per IP.
Scraping frequency — how often a run is scheduled. Controls data freshness at the dataset level.

You can have low rate and high frequency (many slow runs) or high rate and low frequency (few fast runs). The right combination depends on target rate limits, data change velocity, and your freshness SLO. Treating them as the same variable produces pipelines that are either over-built or chronically stale.

02How to measure data change velocity

Before setting a frequency, measure how fast the target data actually changes. Sample 500–1,000 representative pages, scrape them at T=0, scrape them again at T=1h, T=6h, and T=24h. Compare field values across snapshots.

Most datasets have a bimodal change distribution: a small fraction of pages changes frequently (popular products, live pricing) and a large fraction is stable (long-tail catalogue, static content). Setting frequency based on the high-velocity fraction and applying it to the whole scope wastes significant budget. Change-triggered re-scraping is more efficient when variance is high.

03Frequency, scope, and infrastructure cost

Total daily request volume = scope × (24 / frequency_hours). This number drives everything: proxy cost, compute, extraction worker capacity, and storage. Doubling frequency doubles costs linearly — there's no economy of scale.

The common mistake is setting frequency in the requirements phase without knowing scope size, then being surprised by infrastructure costs in production. Scope × frequency × cost-per-request is the correct early estimate. Most clients who come to DataFlirt after a self-build failure have a pipeline where this product was never calculated.

04How DataFlirt handles mixed-frequency pipelines

We segment every pipeline's URL scope by page class and assign independent frequency tiers. A retail pipeline might run: top-500 pricing pages every 15 minutes, active-category listings every 2 hours, full catalogue once daily. The scheduler manages the three tiers independently — a delay in the full catalogue run doesn't block the high-frequency pricing tier.

For clients with tight freshness SLOs on a subset of their data, this approach typically reduces total request volume by 40–60% compared to uniform high-frequency across the full scope.

05Why daily scraping often isn't daily

"Daily" means data is at most 24 hours old — but that's only true if the pipeline runs reliably in under 24 hours, every day, without failures. A pipeline that takes 22 hours to complete and occasionally fails has effective data ages that vary from 24 hours to 46+ hours, depending on when failures occur in the run.

Frequency SLOs should be stated as maximum staleness, not as run cadence: "data must be no more than 26 hours old, measured at delivery time" is a testable SLO. "Daily run" is a schedule — and a schedule with no monitoring is just a cron job hoping for the best.

// 03 — the model

Matching frequency
to change velocity.

The right scraping frequency is derived from the data's change velocity and the business's staleness tolerance. These three models define the decision framework DataFlirt uses when scoping a new pipeline.

Data staleness = S = t_now − t_last_scrape

Staleness is absolute age. SLO is: S must be less than staleness tolerance. Pipeline freshness SLO

Change velocity = V = fields_changed / (total_fields × Δt)

If V is low, high frequency wastes budget. Measure before committing. DataFlirt change detection layer

Optimal frequency = f_opt ≈ k × V / staleness_tolerance

k ≈ 2–5 depending on acceptable miss rate for changes between runs. DataFlirt scheduling model

// 04 — frequency scheduler trace

Scheduled runs
across three pipelines.

Scheduler log showing three pipelines with different frequency settings. Illustrates how frequency, scope size, and scrape rate combine to determine resource consumption per day.

multi-pipeline schedulercadence comparisoncost projection

edge.dataflirt.io — live

CAPTURED

// pipeline A: spot pricing — high frequency
pipeline.A.cadence: "every 15m"
pipeline.A.scope: 2,400 URLs
pipeline.A.runs_per_day: 96
pipeline.A.requests_per_day: 230,400

// pipeline B: category catalogue — medium frequency
pipeline.B.cadence: "every 6h"
pipeline.B.scope: 48,000 URLs
pipeline.B.runs_per_day: 4
pipeline.B.requests_per_day: 192,000

// pipeline C: competitor intelligence — low frequency
pipeline.C.cadence: "daily 03:00 IST"
pipeline.C.scope: 120,000 URLs
pipeline.C.runs_per_day: 1
pipeline.C.requests_per_day: 120,000

// total load
aggregate.requests_per_day: 542,400

// 05 — frequency by use case

What each use case
actually needs.

Frequency requirements by data category, based on DataFlirt's pipeline data across retail, B2B, and intelligence use cases. Most teams over-estimate required frequency for catalogue data and under-estimate it for pricing data.

USE CASES MAPPED · · · across 300+ pipelines

FREQUENCY RANGE · · · 1 min to 7 days

UPDATED · · · · · · 2026-05-19

01

Spot pricing / inventory

1–15 min cadence · Prices move faster than most think

02

Dynamic availability

15–60 min cadence · Stock changes intraday on popular SKUs

03

Product catalogue / listings

6–24h cadence · New listings, attribute changes

04

Competitive intelligence

daily cadence · Pricing strategy, not tick data

05

Regulatory / compliance data

weekly cadence · Policy pages, T&C, disclosures

// 06 — DataFlirt's frequency and change detection

Run when data changes.

not on a fixed clock.

For pipelines where change velocity is uneven — most data is stable, a fraction changes frequently — DataFlirt uses change-triggered re-scraping rather than fixed-clock scheduling. A lightweight hash check on key fields identifies changed pages; only those pages are re-scraped at the full rate. Stable pages are re-validated at low frequency. This reduces total requests by 40–70% for catalogue-heavy pipelines without reducing freshness on the pages that matter.

Frequency scheduler config

Scheduling parameters for a mixed-frequency retail intelligence pipeline.

pipeline.id retail-intel-IN-028

pricing.cadence every 30m

catalogue.cadence every 12h

change_detect.active yes

changed_pages.rate ~18% per 30m window

requests.saved ~62% vs full re-scrape

freshness.slo < 35 min for pricing

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About frequency selection, cost implications, change-triggered scheduling, and how DataFlirt sets cadences that match business requirements without over-spending on unnecessary freshness.

Ask us directly →

How do I decide what scraping frequency I need? +

Measure change velocity on your target before committing to a frequency. Run two scrapes 1 hour apart and compare field values. If 2% of prices changed, hourly is probably right. If 0.01% changed, daily is fine. The goal is to catch meaningful changes before they affect downstream decisions — not to minimise staleness regardless of cost.

What does scraping frequency cost in practice? +

It scales linearly with scope. A 10,000-page pipeline running hourly makes 240,000 requests per day. The same pipeline running every 15 minutes makes 960,000. Proxy cost, compute, and extraction processing all scale with request volume. Most teams underestimate this until they get the first infrastructure bill — scope × frequency × cost_per_request is the right formula for budgeting.

What's the difference between scraping frequency and scrape rate? +

Frequency is how often the pipeline starts a new run — daily, hourly, every 15 minutes. Scrape rate is how fast requests go out during a run — 0.5 req/s, 2 req/s. They're independent axes. A daily pipeline running at 5 req/s and an hourly pipeline running at 0.5 req/s might have the same total daily request volume but very different freshness profiles.

Can I scrape some pages more frequently than others within the same pipeline? +

Yes, and you probably should. High-velocity pages (top SKUs, key competitors, out-of-stock items) should run at a higher cadence than stable catalogue pages. DataFlirt supports per-URL-class frequency tiers within a single pipeline — the pricing pages for the top 500 SKUs can run every 15 minutes while the full 50,000-SKU catalogue runs daily.

Does higher frequency increase the risk of getting blocked? +

Not directly — if the per-IP rate within each run stays constant. Blocks come from request velocity per IP, not from how often you run. A pipeline that runs hourly at 0.5 req/s per IP is no more detectable than one that runs daily at 0.5 req/s per IP. The cumulative IP exposure increases with frequency though, which makes IP rotation more important at high frequencies.

What's the highest frequency DataFlirt supports? +

Sub-minute for targeted spot-price pipelines on a small URL set — we have pipelines running every 90 seconds on 500–2,000 URLs. For larger scopes, the practical ceiling is set by how fast the full scope can be scraped within the cadence window, which depends on scope size, target rate limits, and pool size. We calculate the feasible frequency range during scoping.

$ dataflirt scope --new-project --target=scraping-frequency READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h