← Glossary / Cursor-Based Pagination

What is Cursor-Based Pagination?

Cursor-based pagination is a method of traversing large datasets by passing a unique pointer—the cursor—from the current response into the next request, rather than using page numbers. For data pipelines, it's the only reliable way to scrape high-velocity feeds without dropping records or ingesting duplicates when the underlying database shifts mid-crawl. If you're pulling millions of rows from a live API, offset pagination will fail you; cursors are mandatory.

Network LayerAPI ScrapingData ConsistencyStateful TraversalO(1) Query
// 02 — definitions

Pointers, not
pages.

Why relying on page numbers for live data guarantees pipeline corruption, and how cursors solve the shifting-dataset problem.

Ask a DataFlirt engineer →

TL;DR

Cursor-based pagination uses a stable reference point (like a timestamp or a base64-encoded ID) to fetch the next batch of records. Unlike offset pagination, which skips or duplicates items if the database changes during the scrape, cursors guarantee exactly-once delivery. It's the standard for modern APIs like GraphQL, Twitter, and Stripe.

01Definition & structure

Cursor-based pagination is a technique for iterating through a dataset by using a pointer to a specific item. Instead of asking the server for "page 5", the client asks for "100 items starting after the item with ID 4592".

The cursor is typically returned in the metadata of the API response. It can be a raw database ID, a timestamp, or an opaque, base64-encoded string that the server decodes to find the exact index position. Because the cursor points to a physical record rather than a relative offset, it is immune to data shifting caused by real-time inserts or deletes.

02How it works in practice

In a scraping pipeline, the worker makes an initial request without a cursor parameter. The API returns the first batch of data, plus a next_cursor token. The worker extracts this token and appends it to the URL of the subsequent request (e.g., ?cursor=xyz). This loop continues until the API returns a null cursor or an empty data array, signaling the end of the dataset.

03The offset pagination trap

Offset pagination (?limit=100&offset=5000) is a trap for data engineers. At the database level, the server must fetch 5,100 rows, discard the first 5,000, and return the last 100. As the offset grows, query time degrades linearly, often triggering API timeouts on deep scrapes.

Worse, if a new record is added to the database while you are paginating, all subsequent records shift down by one index. Your scraper will ingest duplicates and miss records entirely. Cursors eliminate both the performance penalty and the data corruption risk.

04How DataFlirt handles opaque cursors

Many modern targets obfuscate their cursors to prevent scrapers from guessing the next page or parallelizing the workload. We treat these as black boxes. Our extraction layer isolates the cursor token using JSON pathing, validates its format against a known schema, and injects it verbatim into the next request.

To scale this, we use a decoupled architecture: a single coordinator thread rapidly traverses the cursor chain, fetching only the metadata, while hundreds of stateless workers process the heavy data payloads in parallel.

05Did you know: cursor expiration

Cursors are not always permanent. Some APIs generate a temporary server-side cache of your query and issue a cursor that acts as a session key. If you pause your scraper for 15 minutes to respect a rate limit, the cache is flushed and the cursor expires. Attempting to use it will result in a 400 Bad Request. Robust pipelines must track the last known record ID to rebuild the query state if the cursor dies.

// 03 — the performance model

Why deep pagination
kills offset queries.

Offset pagination degrades linearly as you go deeper into a dataset. Cursor pagination maintains constant time. DataFlirt enforces cursor traversal for any API target exceeding 10,000 records.

Offset Query Cost = O(N) = offset + limit
The database must scan and discard all preceding rows before returning the page. Standard SQL behavior
Cursor Query Cost = O(1) = index_lookup(cursor_id)
The database jumps directly to the indexed cursor value. Performance is flat. B-Tree Indexing
Pipeline Consistency = C = 1.0
Exactly-once delivery is mathematically guaranteed if the cursor field is immutable. DataFlirt extraction SLO
// 04 — API traversal trace

Following the cursor
through a live feed.

A live trace of a DataFlirt worker paginating through a high-velocity social media API. Notice how the cursor token is extracted from the metadata and injected into the subsequent request.

JSON APIBase64 CursorStateful Traversal
edge.dataflirt.io — live
CAPTURED
// initial request (no cursor)
GET /api/v2/feed?limit=100
response.status: 200 OK
payload.data.length: 100
payload.meta.next_cursor: "eyJpZCI6MTIzNDU2Nzg5LCJ0cyI6MTY4NDU0MzIxMH0="

// decode cursor for logging (optional)
cursor.decoded: {"id":123456789,"ts":1684543210}

// subsequent request
GET /api/v2/feed?limit=100&cursor=eyJpZCI6...
response.status: 200 OK
payload.data.length: 100
payload.meta.next_cursor: "eyJpZCI6MTIzNDU2Nzk5LCJ0cyI6MTY4NDU0MzI1NX0="

// final request (end of feed)
GET /api/v2/feed?limit=100&cursor=eyJpZCI6...
response.status: 200 OK
payload.data.length: 42
payload.meta.next_cursor: null // traversal complete
// 05 — failure modes

Where cursor
traversals break.

Cursors solve data consistency, but they introduce stateful dependencies. If a cursor expires or a request drops, the pipeline must know how to recover without starting from scratch.

PIPELINES MONITORED ·   140+ API targets
CURSOR FAILURES ·  ·  ·   0.8% of runs
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Cursor expiration

time-bound · Token TTL expires before the next request is made
02

Session binding

auth-bound · Cursor is tied to the specific IP or auth token that generated it
03

Opaque token rotation

format shift · Target changes the encryption or encoding of the cursor payload
04

Missing next_cursor field

schema drift · API silently drops the pagination metadata object
05

Rate limit resets

state loss · 429 Too Many Requests forces a backoff that outlasts the cursor TTL
// 06 — DataFlirt's architecture

Stateful traversal,

stateless workers.

Cursor pagination is inherently stateful—you need the result of page N to request page N+1. This breaks naive parallel crawling. DataFlirt solves this by decoupling the cursor discovery from the data extraction. A single lightweight coordinator thread traverses the cursors, fetching only the metadata, and pushes the raw cursor tokens to a distributed queue. Stateless workers then pull those cursors and fetch the actual data payloads at maximum concurrency.

Cursor Coordinator State

Live state of a cursor traversal job on a B2B directory API.

job.id traverse-b2b-099
traversal.mode cursor-decoupled
cursors.discovered 14,205
cursors.expired 0
worker.concurrency 250 nodes
rate_limit.status 429 detectedbacking off
pipeline.status active · 99.9% consistency

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About cursor mechanics, parallelization strategies, token expiration, and how DataFlirt scales stateful API scraping.

Ask us directly →
Why is offset pagination bad for scraping? +
If you request ?offset=100 and a new record is inserted at the top of the database before you request ?offset=200, all records shift down by one. You will scrape record 199 twice and miss record 200 entirely. Cursors lock onto a specific record ID or timestamp, making them immune to upstream database shifts.
Can you parallelize cursor-based pagination? +
Not natively. Because page 2's cursor is inside page 1's response, it's a strictly sequential chain. However, DataFlirt parallelizes this by reverse-engineering the cursor format (e.g., if it's just a base64-encoded timestamp, we can generate cursors synthetically for different time slices) or by decoupling the fast cursor-discovery thread from the heavy data-extraction workers.
What happens if a cursor expires? +
Many APIs (like Twitter or Stripe) cache the query state and issue a cursor with a TTL (Time To Live) of 5 to 15 minutes. If your scraper hits a rate limit and backs off for 20 minutes, the cursor dies. DataFlirt handles this by checkpointing the last successfully scraped record ID and initiating a fresh query using that ID as a synthetic starting point.
Are cursors tied to specific IP addresses or sessions? +
Increasingly, yes. Anti-bot systems like Cloudflare or Akamai will bind a cursor token to the TLS fingerprint or IP address that requested it. If you rotate your proxy between page 1 and page 2, the API returns a 403 or an invalid cursor error. We maintain sticky proxy sessions for the duration of a single cursor chain to prevent this.
How do you handle opaque or encrypted cursors? +
If a cursor is an encrypted JWT or a server-side cache key, you cannot synthesize it. You must treat it as a black box, extract it exactly as provided, and pass it back. If the target rotates the encryption key mid-crawl, the pipeline will fail. We monitor cursor format consistency on every response to catch these rotations instantly.
How does DataFlirt handle infinite scroll on web pages? +
Infinite scroll is just cursor-based pagination disguised as a UI feature. Instead of rendering the page in a headless browser and simulating scroll events (which is slow and expensive), we intercept the XHR/Fetch requests in the network tab, isolate the API endpoint, and traverse the cursors directly via raw HTTP requests.
$ dataflirt scope --new-project --target=cursor-based-pagination READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h