← Glossary / Pagination Detection

What is Pagination Detection?

Pagination detection is the automated process of identifying how a target website structures its multi-page datasets—whether via URL offsets, cursor tokens, infinite scroll APIs, or GraphQL edges—and dynamically adapting the crawler's traversal logic to extract the complete record set. For data pipelines, failing to detect the correct pagination boundary means either missing 80% of a catalog or entering an infinite loop that burns proxy bandwidth and triggers anti-bot bans.

CrawlingCursor TokensInfinite ScrollOffset PaginationTraversal Logic
// 02 — definitions

Finding the
next page.

The algorithmic challenge of determining where a dataset ends and how to request the subsequent chunk without breaking state or triggering rate limits.

Ask a DataFlirt engineer →

TL;DR

Pagination detection maps the traversal path for multi-page datasets. Modern targets rarely use simple ?page=2 URL parameters anymore; they rely on opaque cursor tokens, encrypted state blobs, or dynamic API offsets. A production crawler must reverse-engineer this logic on the fly to ensure 100% record extraction without duplicating requests or hitting infinite loops.

01Definition & structure

Pagination detection is the logic a crawler uses to understand how a dataset is split across multiple requests. Without it, a scraper only ever sees the first 50 items of a 50,000-item catalog. The structure typically falls into one of three schemas:

  • Offset/Page: ?page=2 or ?offset=50. Simple, but prone to data drift.
  • Cursor/Keyset: ?after=token123. A pointer to the last seen record. Highly consistent but forces sequential crawling.
  • Link Headers: RFC 5988 headers providing the exact URL for the rel="next" page. Common in REST APIs.
02How it works in practice

In a production pipeline, the crawler parses the first response payload (HTML or JSON) and applies a set of heuristics to find the "next" trigger. For HTML, this means evaluating CSS selectors for a "Next Page" button or extracting the href attribute. For JSON APIs, it involves scanning the payload for standard keys like hasNextPage, next_cursor, or links.next. The crawler then constructs the subsequent request, appending the necessary state tokens or offset math, and repeats until the boundary condition is met.

03The infinite scroll illusion

Infinite scroll is a UI presentation, not a data structure. Under the hood, the browser is executing standard paginated API requests triggered by scroll events. Naive scraping scripts use headless browsers to inject JavaScript that scrolls the page down, waits for the DOM to mutate, and repeats. This is incredibly inefficient. Proper pagination detection bypasses the DOM, intercepts the XHR requests in the network tab, and replicates the API calls directly.

04How DataFlirt handles it

We decouple traversal logic from extraction logic. Our engine uses AST parsing to automatically detect the pagination schema of a target API during the scoping phase. Once the schema is mapped, our worker fleet executes the traversal at the network layer. We implement strict loop-detection hashing on every page payload to ensure we never burn proxy bandwidth on duplicate data, and we dynamically segment queries to bypass hard result limits on enterprise targets.

05The infinite loop trap

A common anti-scraping tactic (or just poor backend engineering) is the infinite loop trap. When a crawler requests ?page=999 on a category that only has 10 pages, the server doesn't return a 404 or an empty array. Instead, it returns a 200 OK containing the exact same data as Page 1. If the crawler's pagination detection only looks for a 200 status code and a non-empty payload, it will crawl forever, racking up massive proxy bills and eventually triggering an IP ban.

// 03 — the traversal math

How deep is
the dataset?

Estimating total crawl depth and identifying infinite loops before they burn proxy bandwidth. DataFlirt's scheduler uses these heuristics to allocate worker concurrency and detect traversal anomalies.

Expected Pages = total_records / page_size
Often spoofed by targets to break naive crawlers. Always verify against actual payload. Standard offset heuristic
Loop Detection Hash = H(pagen) == H(pagen-1)
If consecutive pages yield identical record hashes, the boundary is reached. DataFlirt anomaly detection
Cursor Entropy = Σ P(c) · log2 P(c)
High entropy indicates encrypted state tokens rather than sequential offsets. Information Theory
// 04 — traversal trace

Intercepting a
GraphQL cursor.

A live trace of a DataFlirt worker identifying and extracting a base64-encoded cursor from a headless e-commerce API, bypassing the DOM entirely.

GraphQLCursor-basedXHR Interception
edge.dataflirt.io — live
CAPTURED
// request page 1
GET /api/graphql?query=ProductList&first=50
status: 200 OK

// parse response payload
pageInfo.hasNextPage: true
pageInfo.endCursor: "YXJyYXljb25uZWN0aW9uOjQ5"

// validate cursor
cursor.decoded: "arrayconnection:49"
cursor.type: "opaque_offset"

// request page 2
GET /api/graphql?query=ProductList&first=50&after=YXJyYXljb25uZWN0aW9uOjQ5
status: 200 OK
records.extracted: 50

// boundary detection (page 14)
pageInfo.hasNextPage: false
pipeline.status: traversal complete
// 05 — failure modes

Why traversal
breaks down.

The most common reasons crawlers fail to extract the complete dataset, ranked by occurrence across DataFlirt's monitoring fleet.

PIPELINES MONITORED ·   850+ active
TRAVERSAL ERRORS ·  ·  ·  per 10k pages
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Opaque cursor expiration

state loss · Token expires before the next request is dispatched
02

Infinite loop trap

bandwidth burn · Site returns page 1 data for out-of-bounds requests
03

Hard limit caps

data loss · Target artificially caps results at 1000 items
04

Dynamic page sizes

offset drift · Inconsistent record counts per page break offset math
05

DOM selector rot

breakage · Next button class changes, breaking UI-driven crawlers
// 06 — DataFlirt's engine

Stop clicking next,

start intercepting the state.

Relying on DOM interactions to trigger pagination is slow, expensive, and fragile. DataFlirt's extraction engine intercepts the underlying XHR requests, identifies the pagination schema—whether it's an offset, a cursor, or a session-bound token—and reconstructs the API calls directly. This shifts traversal from the presentation layer to the network layer, reducing compute costs by 80% and eliminating DOM-related breakage.

Traversal state monitor

Live state of a cursor-based extraction job on a B2B directory.

job.id traverse-b2b-099
pagination.type graphql_cursor
records.extracted 14,250
pages.traversed 285
cursor.entropy high · encrypted state
loop.detected false
pipeline.health nominal

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About pagination schemas, infinite loops, hard limits, and how DataFlirt ensures complete dataset extraction.

Ask us directly →
What is the difference between offset and cursor pagination? +
Offset pagination uses a simple skip/limit math (e.g., ?offset=100&limit=50). It's easy to crawl but suffers from data drift if records are added or deleted during the crawl. Cursor pagination uses a pointer to a specific record (e.g., ?after=token123). It guarantees no duplicates or skipped records, but the crawler must parse the token from the previous response sequentially.
How do you detect an infinite pagination loop? +
Naive crawlers look for a 404 status code or an empty array. Modern targets often return a 200 OK with Page 1's data when you request Page 999. We detect this by hashing the extracted record IDs on each page; if Hash(Page N) == Hash(Page N-1), we've hit the boundary and terminate the loop.
How do you bypass a hard limit of 1,000 results? +
Many platforms (like Elasticsearch-backed catalogs) cap pagination at 1,000 or 10,000 results to protect their database. To extract a 50,000-item category, we dynamically segment the search space. We apply granular filters (e.g., price ranges, specific brands, or date bounds) to create sub-queries that each return fewer than 1,000 results, then stitch them together.
How does DataFlirt handle infinite scroll? +
We don't use headless browsers to simulate scrolling—that's computationally wasteful. Infinite scroll is just a UI pattern wrapped around an API. We intercept the network traffic, identify the XHR request triggered by the scroll event, extract the pagination parameters, and replicate the API calls directly at the network layer.
Is it legal to bypass UI pagination limits? +
Accessing publicly available data via the underlying API rather than the UI is generally considered lawful, provided you aren't bypassing authentication or breaching rate limits. The API is the server's chosen method of delivering the data. However, aggressively segmenting queries to dump an entire database can trigger ToS violations or anti-bot blocks if not paced correctly.
Can you parallelise cursor-based pagination? +
Strict cursor pagination is inherently sequential—you need the token from Page 1 to request Page 2. However, DataFlirt parallelises the extraction by discovering entry points. We crawl the category tree or sitemap to find hundreds of distinct starting nodes, then run sequential cursor traversal on each node concurrently across our worker fleet.
$ dataflirt scope --new-project --target=pagination-detection READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h