← Glossary / Paginated Listing Pages

What is Paginated Listing Pages?

Paginated listing pages are structural patterns used by e-commerce sites, directories, and search engines to split large datasets across multiple sequential views. Whether implemented via URL parameters, offset limits, or opaque cursors, they dictate how a crawler traverses a catalog. Mishandling pagination logic doesn't just slow down your pipeline — it guarantees duplicate records, infinite loops, and incomplete datasets.

Site StructureTraversalCursorOffsetCrawl Depth
// 02 — definitions

Traversing
the catalog.

How sites chunk data to save database load, and how crawlers reverse-engineer those chunks to reconstruct the full dataset.

Ask a DataFlirt engineer →

TL;DR

Pagination splits a large result set into manageable pages. Crawlers must identify the 'next page' mechanism — URL offsets, API cursors, or 'Load More' buttons — and traverse it until the end condition is met. The primary challenge at scale is parallelizing this traversal without missing records due to real-time inventory shifts.

01Definition & structure

Paginated listing pages are the standard method for delivering large collections of items — products, articles, or search results — over HTTP. Instead of returning 50,000 records in a single 100MB payload, the server returns them in chunks of 20 to 100.

There are three primary structural patterns:

  • Offset / Page Number: ?page=2 or ?offset=100. Easy to parallelize, but susceptible to data shifts.
  • Cursor / Keyset: ?after=eyJpZCI6NDV9. A pointer to the last seen record. Prevents data shifts but forces sequential crawling.
  • Infinite Scroll: A UI pattern that triggers offset or cursor API calls via JavaScript as the user scrolls down.
02How it works in practice

A crawler must first identify the pagination mechanism. For HTML pages, this means finding the CSS selector for the rel="next" link or the "Next Page" button. For APIs, it means extracting the next_cursor token from the JSON response.

The crawler then enters a loop: fetch page, extract items, extract next token, construct next URL, repeat. The loop terminates when the next token is null, the "Next" button is absent, or the extracted item count is zero. At scale, this loop is managed by a distributed queue to handle retries and proxy rotations.

03The shifting inventory problem

When scraping a highly active target (like a real estate portal or a fast-fashion site) using offset pagination, the underlying database is constantly changing. If a new listing is published while you are on Page 2, everything shifts down. The last item on Page 2 becomes the first item on Page 3.

When your crawler hits Page 3, it extracts that item again. Conversely, if an item is deleted, records shift up, and your crawler will skip an item entirely. This is why robust pipelines never rely on pagination for data integrity — they enforce strict primary-key deduplication downstream.

04How DataFlirt handles it

We treat pagination as a discovery mechanism, not a data contract. For offset targets, our scheduler generates the full URL queue upfront and fetches them concurrently across our residential proxy pool. For cursor targets, we attempt to decode the cursor to parallelize it; if opaque, we shard the category using secondary filters (like price bands or brands) to create dozens of smaller, parallel sequential crawls.

Every pipeline includes automated depth-limit detection and infinite-loop circuit breakers, ensuring we never waste compute on broken target logic.

05Did you know?

Many major e-commerce platforms intentionally spoof the "Total Results" count on Page 1 for SEO purposes. A category might claim "145,000 results found", but if you try to paginate past Page 100, the server returns a 404 or loops back to Page 1. This is a hard limit imposed by their search index (usually ElasticSearch or Solr) to prevent deep-paging memory exhaustion. The only way to get all 145,000 items is to slice the category into smaller, highly specific sub-queries.

// 03 — traversal math

Calculating
crawl depth.

Pagination dictates how many requests are required to extract a category. DataFlirt uses these models to pre-allocate concurrency budgets before a crawl begins.

Total Pages (Offset) = P = ceil(total_results / page_size)
Basic offset math. Fails if the target spoofs total_results for SEO. Standard SQL-backed pagination
Offset Calculation = O = (page_number − 1) × page_size
Used to generate concurrent URL queues for offset-based targets. DataFlirt URL generator
Sequential Crawl Time = T = P × latency_per_page
Cursor pagination forces concurrency to 1 per thread, maximizing T. Pipeline execution model
// 04 — pagination trace

Navigating a cursor
and hitting a limit.

A live trace of a crawler navigating a cursor-based API endpoint. Notice how the sequential nature of cursors forces synchronous fetching, and how the target enforces a hard depth limit.

Cursor PaginationAPI ScrapingDepth Limit
edge.dataflirt.io — live
CAPTURED
// init category crawl: laptops
target: "https://api.target.com/v1/products?category=laptops"
pagination.type: "cursor"

// page 1
fetch: "?limit=100"
items_extracted: 100
next_cursor: "eyJvZmZzZXQiOjEwMH0=" // valid

// page 2
fetch: "?limit=100&cursor=eyJvZmZzZXQiOjEwMH0="
items_extracted: 100
next_cursor: "eyJvZmZzZXQiOjIwMH0=" // valid

// ... skipping to page 101
fetch: "?limit=100&cursor=eyJvZmZzZXQiOjEwMDAwfQ=="
response: 400 Bad Request
error.message: "Result window is too large, limit is 10000"

// recovery strategy
action: sharding category by price range
status: restarted with 5 sub-queries
// 05 — failure modes

Where pagination
breaks pipelines.

Ranked by frequency of occurrence across DataFlirt's active pipelines. Pagination seems simple until you hit edge cases designed to protect backend databases or trap naive bots.

PIPELINES MONITORED ·   300+ active
PAGINATION FAILS ·  ·  ·  14% of total errors
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Hard depth limits

ElasticSearch 10k limit · Target refuses to serve pages beyond a certain offset
02

Infinite loop traps

Circular canonicals · Page 50 links back to Page 50, trapping the crawler
03

Shifting inventory

Duplicate records · Items pushed to the next page while crawling
04

Stale cursors

Token expiration · Cursor expires before the next request is made
05

Missing 'Next' selectors

A/B test breakage · UI changes break the CSS selector for the next button
// 06 — our architecture

Parallelizing the sequential,

breaking the cursor bottleneck.

Offset pagination is easy to parallelize: you divide the total count by page size and dispatch 100 concurrent workers. Cursor pagination, however, is inherently sequential — you need page 1's response to get page 2's cursor. DataFlirt bypasses this bottleneck by reverse-engineering the cursor payload (often just base64-encoded timestamps or IDs) or using secondary sort filters to create artificial shards, turning a 4-hour sequential crawl into a 3-minute parallel extraction.

pagination.strategy

DataFlirt's automated traversal config for a major e-commerce target.

target.type GraphQL API
pagination.mode opaque_cursor
cursor.decoded base64(timestamp)predictable
sharding.active truesplit by category+brand
concurrency 64 workers
duplicate.rate 0.02%deduped at sink
pipeline.status active · 99.9% coverage

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About offset vs cursor pagination, infinite loops, handling shifting inventory, and how DataFlirt scales traversal.

Ask us directly →
What is the difference between offset and cursor pagination? +
Offset pagination uses a page number or skip count (e.g., ?offset=100). It's easy to parallelize but suffers from performance degradation on deep pages and data shifts. Cursor pagination uses a pointer to a specific record (e.g., ?after=token123). It's highly performant for the database and prevents data shifts, but forces crawlers to fetch sequentially.
How do you handle items shifting between pages during a crawl? +
If a new item is added to Page 1 while you are crawling Page 3, an older item gets pushed to Page 4. When you reach Page 4, you'll scrape that item again. We handle this by enforcing strict primary-key deduplication at the delivery sink. We never rely on the target's pagination stability for data uniqueness.
What happens when a site caps pagination at 10,000 results? +
This is the classic ElasticSearch max_result_window limit. If a category has 45,000 items, you can only reach the first 10,000. DataFlirt bypasses this by dynamically sharding the request: we apply filters (e.g., price ranges $0-$10, $10-$20) to ensure no single sub-query exceeds the 10,000 item limit, allowing full extraction.
How does DataFlirt detect infinite pagination loops? +
Some sites have broken logic where Page 100's "Next" button links to Page 100, or the API returns an empty list but a valid next cursor. Our scheduler tracks the hash of the extracted payload and the URL state. If the payload hash repeats or the URL state stagnates for three consecutive requests, the worker aborts and flags the pipeline for review.
Do you need a headless browser for 'Load More' buttons? +
Rarely. "Load More" buttons or infinite scroll events almost always trigger a background XHR/fetch request to an API endpoint returning JSON or HTML fragments. We intercept that API call during the scoping phase and replicate it directly in our HTTP fetchers. Using Playwright just to click a button is a massive waste of compute.
Is scraping paginated public data legal? +
Yes, accessing publicly available, indexable data via pagination is generally lawful, supported by precedents like hiQ v. LinkedIn. Pagination is just a structural delivery mechanism. We respect robots.txt directives, adhere to rate limits, and never bypass authentication to access paginated endpoints.
$ dataflirt scope --new-project --target=paginated-listing-pages READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h