← Glossary / Distributed Locking

What is Distributed Locking?

Distributed locking is a concurrency control mechanism that ensures only one scraper worker in a fleet can access a specific resource — a target URL, an authenticated session, or a specific proxy IP — at any given time. Without it, parallel workers will inevitably fetch the same page twice, trigger concurrent login bans, or blow past target rate limits. It's the synchronization layer that makes horizontal scaling safe.

ConcurrencyRedisRedlockRace ConditionsState Management
// 02 — definitions

Coordinate
the fleet.

How to stop 500 parallel workers from trampling each other, duplicating work, and getting your accounts banned.

Ask a DataFlirt engineer →

TL;DR

Distributed locking uses a centralized store (typically Redis) to grant mutually exclusive access to a resource. When a worker needs to scrape a rate-limited endpoint or use a shared account, it requests a lock. If granted, it proceeds; if denied, it yields. It prevents race conditions in high-concurrency scraping pipelines.

01Definition & structure
A distributed lock is a mechanism that provides mutually exclusive access to a shared resource across multiple independent processes or machines. In a scraping context, the "resource" is usually an authenticated session, a specific proxy IP, or a strict rate-limit bucket for a target domain. The lock state is held in a fast, centralized datastore like Redis.
02Why scraping needs it
When you scale from 1 worker to 100 workers, race conditions emerge. If two workers try to use the same B2B account credentials simultaneously, the target site will often invalidate the session or ban the account. If 50 workers hit the same domain at the exact same millisecond, you trigger a Cloudflare block. Distributed locking ensures that workers coordinate their actions, acting as a traffic cop for the fleet.
03The Redlock algorithm
The industry standard for Redis-based locking is Redlock. Instead of relying on a single Redis instance (which represents a single point of failure), Redlock requires the worker to acquire the lock on a majority of independent Redis nodes (e.g., 3 out of 5). This prevents split-brain scenarios where a network partition causes two different workers to believe they both hold the lock.
04How DataFlirt handles it
We use a dynamic heartbeat locking system. Hardcoded TTLs are dangerous because page load times are unpredictable. Our workers acquire a lock with a short 5-second TTL. A background thread sends a heartbeat to Redis every 2 seconds to extend the lock as long as the browser is still rendering the page. If the worker container OOMs or crashes, the heartbeat stops, and the lock clears itself in 5 seconds, preventing deadlocks.
05The unsafe release trap
The most common bug in custom scraping infrastructure is the "unsafe release." Worker A acquires a lock. Worker A is slow, so the lock expires. Worker B acquires the lock. Worker A finally finishes and blindly deletes the lock key — accidentally deleting Worker B's lock. Worker C now enters, and you have concurrent execution. This is solved by using a unique fencing token (UUID) and a Lua script to ensure a worker only deletes a lock if the token matches its own.
// 03 — lock mechanics

How long should
a lock live?

Setting the right Time-To-Live (TTL) is the hardest part of distributed locking. Too short, and the lock expires before the scrape finishes. Too long, and a crashed worker stalls the pipeline.

Safe TTL baseline = TTL = Tmax_scrape + Tclock_drift + 2s
Buffer for network latency and worst-case page load times. Distributed Systems heuristics
Redlock Quorum = Q = (Nnodes / 2) + 1
Majority consensus required across Redis nodes to safely acquire a lock. Redis Redlock Algorithm
Heartbeat extension = Textend = TTL × 0.3
Background thread interval to keep the lock alive while the worker is still active. DataFlirt worker architecture
// 04 — redis trace

Acquiring a lock
for an auth session.

A worker attempting to use a shared B2B account. It requests a lock on the account ID. Another worker already holds it, so it backs off and retries.

RedisRedlockAuth pool
edge.dataflirt.io — live
CAPTURED
// Worker 42 requesting account lock
CMD: SET resource:acct_991 worker_42_uuid NX PX 5000
reply: (nil) // lock held by Worker 17
action: backoff 500ms

// Retry after backoff
CMD: SET resource:acct_991 worker_42_uuid NX PX 5000
reply: OK
status: lock acquired

// Scrape complete, releasing lock safely via Lua script
CMD: EVAL "if redis.call('get',KEYS[1]) == ARGV[1] then return redis.call('del',KEYS[1]) else return 0 end"
reply: 1 // lock released
// 05 — failure modes

Where distributed
locks break.

Locking introduces its own class of distributed systems failures. Ranked by frequency of occurrence in unmanaged scraping infrastructure.

LOCK STORE ·  ·  ·  ·  ·  Redis Cluster
AVG HOLD TIME ·  ·  ·  ·  1.2 seconds
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

TTL Expiry (Slow Worker)

concurrency leak · Worker takes too long, lock expires, second worker enters
02

Unsafe Release

data corruption · Worker deletes a lock that now belongs to someone else
03

Crashed Worker (No TTL)

deadlock · Worker dies holding lock, resource permanently stalled
04

Clock Drift

split brain · Redis nodes disagree on time, granting multiple locks
05

Redis Network Partition

split brain · Master/replica failover loses lock state
// 06 — our architecture

Locking without stalling,

using heartbeat extensions.

Static lock TTLs are a trap for web scraping. A page might take 800ms to load today and 14 seconds tomorrow due to target server load. If the lock expires at 10 seconds, a second worker grabs it, and you get concurrent execution. DataFlirt uses a dynamic heartbeat model: workers acquire a short 5-second lock, and a background thread extends it every 2 seconds as long as the browser context is still active. If the worker hard-crashes, the lock clears almost instantly.

Lock Manager State

Live view of a distributed lock on a high-value target domain.

resource.id domain:target.com:rate_limit
lock.holder worker-node-8f4a
lock.ttl 5000ms
heartbeat.status active
extensions.count 4
contention.queue 12 workers waiting
fencing.token uuid-v7-99a1

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about concurrency control, Redis locking, and avoiding deadlocks in scraping pipelines.

Ask us directly →
Why not just use a local thread lock? +
Local locks only work within a single process. When you scale horizontally to 50 containers, they don't share memory. You need a centralized store like Redis to coordinate across the network, otherwise Container A has no idea what Container B is doing.
What happens if a worker crashes while holding a lock? +
If implemented correctly, the lock has a TTL (Time-To-Live) and will automatically expire, freeing the resource for the next worker. If implemented poorly (no TTL), the resource is deadlocked forever and requires manual intervention to clear the key.
How does DataFlirt handle rate limits across thousands of IPs? +
We use distributed token buckets backed by Redis. Before a worker routes a request through a specific proxy IP to a specific domain, it must acquire a token. If the bucket is empty, the worker yields and processes a different domain, ensuring we never exceed the target's threshold.
Is Redis the only way to do this? +
No. ZooKeeper, etcd, and even PostgreSQL advisory locks work. But Redis is the industry standard for scraping because it's exceptionally fast, and scraping locks are highly ephemeral (held for milliseconds to seconds). The overhead of etcd consensus is usually overkill for a scraping queue.
How do you prevent a slow worker from deleting a new worker's lock? +
By using a unique fencing token (usually a UUID) when acquiring the lock. The release script (a Lua script in Redis) checks if the value matches the token before deleting. If it doesn't match, it means the original lock expired and a new worker owns it, so the slow worker leaves it alone.
Can distributed locking prevent duplicate URL scraping? +
Yes, but a deduplication queue or a Bloom filter is usually more efficient for URL state. Distributed locking is better suited for active resource constraints: "Only one worker can use Account A right now" or "Only one request per second to this specific API endpoint."
$ dataflirt scope --new-project --target=distributed-locking READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h