← Glossary / API Response Pagination

What is API Response Pagination?

API response pagination is the mechanism servers use to divide large datasets into manageable, sequential chunks across multiple HTTP requests. For scraping pipelines, it dictates how you traverse an endpoint to extract a complete dataset without triggering rate limits or memory exhaustion. Handling pagination incorrectly leads to duplicate records, infinite loops, or silent data loss when the underlying dataset shifts during the crawl.

Network LayerCursorOffsetData CompletenessAPI Scraping
// 02 — definitions

Traversing the
data sequence.

The structural patterns APIs use to serve millions of records without crashing, and how scrapers must adapt to capture every row.

Ask a DataFlirt engineer →

TL;DR

Pagination splits massive API payloads into discrete pages using offsets, cursors, or page tokens. It's a fundamental constraint in data extraction. The difference between a fragile script and a production pipeline is how it handles cursor expiration, rate limits, and dataset mutations while traversing thousands of pages.

01Definition & structure
API response pagination is a design pattern used to restrict the size of data returned in a single HTTP response. Instead of dumping a million records at once—which would crash the server and timeout the client—the API returns a subset (a "page") along with metadata on how to fetch the next subset. For data pipelines, mastering pagination means writing logic that can reliably follow these breadcrumbs until the dataset is completely exhausted.
02Offset vs. Cursor vs. Page Token
There are three dominant paradigms:
  • Offset: Uses ?limit=100&offset=200. Easy to parallelize, but vulnerable to data shifting (if a record is deleted, everything shifts up, causing you to skip a row).
  • Cursor: Uses a unique identifier ?after_id=994. Highly stable and performant, but strictly sequential.
  • Page Token: Uses an opaque string ?token=xyz123. Often stateful on the server side, meaning it can expire if you wait too long between requests.
03The data mutation problem
The biggest silent failure in API scraping is dataset mutation during an offset-based crawl. If you are scraping an active e-commerce catalog and a product is added to page 1 while you are fetching page 5, all subsequent records shift down. When you request page 6, you will ingest a duplicate of the last item from page 5. Conversely, if an item is deleted, you will silently skip a record. This is why cursor-based pagination is heavily preferred for data integrity.
04How DataFlirt handles it
We treat pagination as a critical state machine. Our workers persist their exact position to a Redis cluster after every successful page extraction. If a proxy gets banned or the target API goes down for maintenance, the job suspends and resumes later from the exact cursor or offset. We also deploy automated loop detection to catch APIs that improperly return the final page infinitely, ensuring pipelines terminate cleanly.
05Did you know?
Many modern APIs implement a hard limit on offset depth (commonly 10,000 records) to protect their database from expensive deep-paging queries. If you try to request ?offset=10001, the API will return a 400 error. To extract datasets larger than this limit, scrapers must dynamically slice the queries using filters (like date ranges or price brackets) to ensure no single query exceeds the 10,000 record threshold.
// 03 — the math

Calculating the
traversal depth.

Understanding the bounds of a paginated endpoint is critical for capacity planning. DataFlirt uses these models to allocate worker concurrency and estimate pipeline completion times.

Offset calculation = Offset = (Page − 1) × Limit
The standard SQL-backed pagination formula. Vulnerable to data shifts. Standard REST convention
Total requests required = Reqs = ⌈ Total_Records / Page_Size
Determines the minimum number of HTTP calls to exhaust the endpoint. Pipeline capacity model
DataFlirt parallelization factor = Workers = min(Max_Concurrency, Total_Records / Chunk_Size)
How we slice offset-based endpoints to reduce a 10-hour crawl to 15 minutes. Internal scheduler logic
// 04 — pipeline execution

Following the cursor
through 50,000 records.

A live trace of a DataFlirt worker traversing a cursor-based API. Notice the state persistence and the handling of a mid-crawl rate limit.

Cursor-basedStateful resumeRate limit backoff
edge.dataflirt.io — live
CAPTURED
// init traversal
target.endpoint: "https://api.target.com/v3/catalog"
pagination.type: "cursor"

// page 1
GET "?limit=1000"
status: 200 OK records: 1000
next_cursor: "eyJpZCI6MTAwMH0="
state.checkpoint: saved to redis

// page 42
GET "?limit=1000&cursor=eyJpZCI6NDIwMDB9"
status: 429 Too Many Requests
retry_after: 15s
worker.action: sleeping 15000ms

// page 42 (retry)
GET "?limit=1000&cursor=eyJpZCI6NDIwMDB9"
status: 200 OK records: 1000

// page 50 (final)
next_cursor: null
pipeline.status: exhausted · 50,000 records extracted
// 05 — failure modes

Where pagination
breaks down.

Ranked by frequency across DataFlirt's API extraction pipelines. Pagination seems simple until you hit scale, at which point state management and edge cases dominate.

PIPELINES MONITORED ·   412 active
PAGINATION TYPE ·  ·  ·   API endpoints
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Dataset mutation during crawl

duplicates/misses · Records shift across page boundaries while scraping
02

Cursor expiration / timeout

state loss · Taking too long between requests invalidates the token
03

Infinite pagination loops

logic error · API returns the same cursor or last page repeatedly
04

Hard offset limits

max 10,000 · Elasticsearch/Solr rejecting deep offset queries
05

Inconsistent page sizes

validation fail · API returns fewer records than limit despite more existing
// 06 — our architecture

Traverse deeply,

without losing your place.

DataFlirt's extraction engine treats pagination as a stateful, resumable operation. We don't just follow next links blindly in memory. We persist cursor state to Redis after every successful batch. If a worker dies, a proxy rotates, or an API throws a 502 on page 4,000, the pipeline resumes exactly where it left off. For offset-based APIs that support it, we partition the total record space using date or category filters and extract chunks concurrently, bypassing hard offset limits and drastically reducing crawl time.

pagination.state

Live state of a resumable pagination job in the DataFlirt scheduler.

job.id api-traverse-882
strategy cursor-based
records.yielded 42,000
current_cursor eyJpZCI6NDIwMDB9
checkpoint.age 1.2s ago
rate_limit.hits 3handled
loop_detection activeclean

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About pagination strategies, handling hard limits, parallelization, and how DataFlirt ensures data completeness across massive API endpoints.

Ask us directly →
What is the difference between offset and cursor pagination? +
Offset pagination uses limit and offset (e.g., skip 100, take 50). It's easy to implement but suffers from performance degradation on deep pages and data shifts (if a record is added, everything shifts, causing duplicates or missed rows). Cursor pagination uses a unique pointer (e.g., after_id=994). It is stable against data shifts and highly performant, but harder to jump to a specific page.
How do you bypass the 10,000 record offset limit? +
Many APIs backed by Elasticsearch or Solr hard-cap offsets at 10,000 to prevent memory exhaustion. To extract 500,000 records, you cannot just paginate deeply. We use filter slicing: we dynamically inject filters (like narrow date ranges, price brackets, or alphabetical prefixes) to ensure no single query matches more than 10,000 records, paginating fully within each slice.
Can you parallelize cursor-based pagination? +
Strictly speaking, no. Cursor pagination is inherently sequential — you need the response of page N to get the cursor for page N+1. However, DataFlirt parallelizes the overall job by splitting the initial query into orthogonal segments (e.g., one worker per category or date range), allowing multiple sequential cursor chains to run concurrently.
How does DataFlirt prevent infinite pagination loops? +
Poorly implemented APIs sometimes return the final page repeatedly instead of an empty array or null cursor. We maintain a rolling hash of the last three response payloads and track cursor values. If the payload hash matches exactly, or the cursor fails to advance, our loop detection circuit breaks the traversal and marks the endpoint as exhausted.
What happens if a cursor expires mid-crawl? +
Some APIs use time-sensitive cursors (like AWS or certain GraphQL endpoints) that expire if not used within 5 minutes. If a rate limit or network error delays the worker and the cursor dies, DataFlirt's state manager falls back to the last known stable anchor, re-fetches the previous page to generate a fresh cursor, and resumes without duplicating data.
Is it legal to scrape paginated public APIs? +
Yes, accessing publicly available data via an exposed API is generally lawful, provided you do not bypass authentication or breach specific terms of service. However, aggressive pagination can trigger denial-of-service protections. We strictly adhere to rate limits and concurrency caps to ensure our extraction remains non-disruptive to the target's infrastructure.
$ dataflirt scope --new-project --target=api-response-pagination READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h