← Glossary / Scraper State Persistence

What is Scraper State Persistence?

Scraper state persistence is the architectural practice of continuously serialising a crawler's runtime context — session cookies, pagination cursors, visited URL bloom filters, and proxy bindings — to an external datastore like Redis. When a worker inevitably crashes due to an out-of-memory error or a proxy timeout, a new worker can mount the saved state and resume the job exactly where it left off. Without it, long-running extraction jobs are fragile monoliths that restart from zero on every failure.

InfrastructureCheckpointingSession ManagementRedisDistributed Crawling

// 02 — definitions

Survive the
crash.

Why treating your scraping workers as ephemeral, stateless compute nodes requires making your session data highly durable.

Ask a DataFlirt engineer →

TL;DR

Scraper state persistence separates the compute (the worker) from the context (the session). By writing cookies, local storage, and job progress to a fast key-value store every few seconds, pipelines can survive node failures, rotate IPs without losing authentication, and distribute a single logical crawl across hundreds of physical machines.

01Definition & structure

Scraper state persistence is the mechanism of saving a crawler's internal context to an external, highly available datastore. A complete state payload typically includes:

auth — Session cookies, bearer tokens, and CSRF tokens.
progress — Pagination cursors, queue offsets, and depth counters.
history — A Bloom filter or hash set of already-visited URLs.
routing — The specific proxy IP or ASN bound to the session to maintain geographic consistency.

By externalising this data, the scraping worker becomes a stateless function that simply executes instructions based on the loaded context.

02How it works in practice

When a worker boots up, it queries the orchestrator for a job assignment. Instead of just receiving a URL, it receives a state ID. The worker pulls the state payload from Redis, injects the cookies into its HTTP client or headless browser, sets its proxy configuration, and resumes fetching. Every few seconds, or after every successful extraction, the worker calculates the diff of its state (e.g., new cookies set by the server, updated pagination cursor) and commits it back to Redis.

03The authentication preservation angle

Logging in is expensive. It often requires solving a CAPTCHA, executing heavy JavaScript, or burning a high-quality residential proxy request. If a worker crashes and loses its session, you have to pay that login cost again. Worse, target servers monitor login frequency; authenticating 50 times an hour for the same account is a massive red flag. State persistence allows a single authenticated session to outlive the physical machine that created it, drastically reducing block rates.

04How DataFlirt handles it

We engineer for failure. Our infrastructure assumes any worker node could be terminated at any millisecond. We run a dedicated Redis cluster specifically for state checkpointing, completely separate from our URL queues. Checkpoints occur asynchronously every 5 seconds. When our orchestrator detects a worker heartbeat timeout, it immediately provisions a replacement, mounts the 5-second-old state, and re-queues the last unacknowledged URL. The pipeline self-heals without human intervention.

05The Bloom filter advantage

One of the hardest parts of state persistence at scale is tracking visited URLs. If you are crawling 100 million pages, storing every URL as a string in Redis requires gigabytes of RAM and slows down state syncs. We use Redis Bloom filters. A Bloom filter can tell a worker "you have definitely not visited this URL" or "you probably have visited this URL" using only a few megabytes of memory. It is the only mathematically viable way to persist crawl history at enterprise scale.

// 03 — the recovery math

Calculating the cost
of amnesia.

State persistence isn't just about reliability; it's a direct optimization of compute and proxy bandwidth. DataFlirt models the cost of state loss to tune our checkpoint frequency and minimize wasted work.

Wasted compute penalty = T_elapsed × C_worker × P_crash

The financial cost of starting from zero on a long-running job without checkpoints. Standard reliability model

Checkpoint overhead = S_bytes / B_network × F_freq

Network time spent serialising state. Must be kept under 50ms to avoid blocking the event loop. DataFlirt infrastructure SLO

DataFlirt recovery time = T_provision + T_mount + T_resume

Total time from worker death to replacement worker fetching the next URL. Currently < 800ms. Internal telemetry, v2026.5

// 04 — worker lifecycle

A worker dies.
The session lives.

Trace of a worker encountering a fatal DOM crash, followed by the orchestrator spinning up a replacement that mounts the exact same session state.

RedisPlaywrightAuto-healing

edge.dataflirt.io — live

CAPTURED

// worker-042 active
state.checkpoint: committed "cursor: page=412"
state.cookies: synced "session_id=9a8b7c..."
memory.usage: 1.8GB // leak detected in target SPA

// crash event
FATAL: Out of memory (OOM killer)
worker-042: terminated

// orchestrator intervention
worker-043: provisioned "node-pool-alpha"
state.mount: "fetch from redis://cluster-01/session:9a8b7c"
browser.context: restored "14 cookies, 2 localStorage keys"

// resume
queue.pop: "https://target.com/catalog?page=413"
status: 200 OK // auth maintained, no login required

// 05 — state payloads

What actually gets
serialised.

The components of a scraper's state, ranked by their frequency of mutation and impact on pipeline continuity if lost.

AVG STATE SIZE · · · 45 KB

SYNC FREQUENCY · · · 5 seconds

DATASTORE · · · · · Redis Cluster

01

Visited URL registry

Bloom filter · Prevents infinite loops and duplicate extraction

02

Authentication tokens

Cookies/JWTs · Bypasses the need to re-login on worker rotation

03

Pagination cursors

Job progress · Ensures the crawler resumes at the exact right page

04

Proxy affinity bindings

IP routing · Keeps the session tied to the same residential exit node

05

Local storage / IndexedDB

DOM state · Maintains client-side SPA state across browser restarts

// 06 — our architecture

Ephemeral workers,

durable sessions.

At DataFlirt, we treat every browser instance as hostile and prone to sudden death. Our workers are entirely stateless. Every 5 seconds, the active session context is serialised and pushed to a distributed Redis cluster. When a node inevitably fails — whether from a memory leak, a proxy ban, or a cloud provider preemption — the orchestrator spins up a replacement, injects the saved state, and resumes the crawl with less than 800ms of pipeline interruption. The target server never sees a broken session.

Session State Monitor

Live telemetry from a distributed crawl job recovering from a node failure.

job.id crawl-b2b-catalog-09

active.workers 128 nodes

state.datastore Redis · 3ms latency

checkpoint.freq 5000ms

last.crash 14s ago · worker-042

recovery.time 742ms

auth.drops 0

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About checkpointing, session management, distributed crawling, and how DataFlirt maintains state across millions of requests.

Ask us directly →

Why not just keep the state in memory? +

Because memory is tied to the process. If the process dies — due to an unhandled exception, an out-of-memory error from a bloated SPA, or a server reboot — the state dies with it. For a job scraping 10,000 pages, losing memory means starting over. Externalising state to Redis makes the worker disposable and the job invincible.

How do you handle cookies expiring while saved in state? +

State persistence isn't just dumb storage; it requires lifecycle management. When a worker mounts a state, it checks the expiry timestamps on the JWTs or cookies. If they are within 5 minutes of expiration, the worker triggers a silent refresh flow before resuming the crawl queue. The updated tokens are then immediately committed back to the state store.

What is the performance overhead of saving state constantly? +

Minimal, if engineered correctly. We don't serialise the entire DOM. We extract only the critical context (cookies, local storage keys, cursors) which usually totals under 50KB. Writing 50KB to a VPC-local Redis cluster takes roughly 2–4 milliseconds. We run this asynchronously so it never blocks the main extraction event loop.

How does DataFlirt track visited URLs across hundreds of workers? +

We use distributed Bloom filters in Redis. A standard hash set for 50 million URLs would consume gigabytes of RAM and choke the network. A Bloom filter can probabilistically check if a URL has been visited using just a few megabytes of shared memory, allowing hundreds of concurrent workers to avoid duplicate fetches instantly.

Can state persistence help avoid anti-bot detection? +

Absolutely. Anti-bot systems flag clients that request 500 pages but have no session history, or clients that log in repeatedly from different IPs. By persisting the session state and proxy affinity, a new worker looks exactly like the old worker to the target server. The trust score built up by the previous worker is inherited.

Is it legal to persist session tokens and cookies? +

Yes. Persisting cookies and session tokens is exactly what a standard web browser does when you close and reopen it. You are simply managing the client-side state that the server explicitly asked you to hold. As long as the underlying access is authorized, how you store the client state is an infrastructure implementation detail.

$ dataflirt scope --new-project --target=scraper-state-persistence READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h