← Glossary / Memory Usage Per Worker

What is Memory Usage Per Worker?

Memory Usage Per Worker is the baseline RAM footprint required to keep a single scraping process alive and actively processing requests. In headless browser pipelines, it dictates how many concurrent tabs you can pack onto a single node before triggering an Out-Of-Memory (OOM) kill. For HTTP-only scrapers, it defines your maximum async connection pool size. Controlling this metric is the difference between a profitable data pipeline and one that burns cash on idle cloud compute.

InfrastructureResource AllocationOOM KillsHeadless BrowsersConcurrency
// 02 — definitions

RAM is
the ceiling.

CPU rarely bottlenecks a modern scraping pipeline. Memory is the hard limit that dictates your concurrency, your cloud bill, and your crash rate.

Ask a DataFlirt engineer →

TL;DR

A single Playwright worker can consume anywhere from 150MB to 1.2GB of RAM depending on the target's DOM complexity and memory leaks. HTTP-only workers sit around 30-80MB. If you don't monitor and cap memory usage per worker, your orchestrator will eventually kill the container mid-scrape, dropping data and forcing expensive retries.

01Definition & structure
Memory Usage Per Worker measures the amount of RAM consumed by a single, isolated scraping process. In a distributed architecture, a "worker" might be a Docker container running a Node.js script, a Celery task executing a Python function, or a dedicated headless browser instance. This metric includes the runtime environment overhead, the network buffers, and—crucially—the memory required to parse and store the target's response payload.
02How it dictates pipeline economics
Cloud compute is billed by CPU and RAM. Because scraping is heavily I/O bound (waiting on network requests), CPU is rarely fully utilized. Memory becomes the limiting factor for concurrency. If your worker requires 500MB of RAM, a 4GB server can safely run about 6 concurrent workers. If you optimize that worker down to 100MB, the same server can run 30 workers. Halving your memory usage effectively halves your infrastructure costs.
03The headless browser penalty
Transitioning from an HTTP client (like requests or axios) to a headless browser (like Playwright) increases memory usage per worker by an order of magnitude. The browser must allocate memory for the V8 JavaScript engine, the rendering pipeline, the GPU process (even if software-emulated), and the DOM tree. A single complex page with heavy JavaScript can easily push a browser tab's memory footprint past 1GB.
04How DataFlirt handles it
We treat memory as a highly volatile resource. Our orchestrator monitors the RSS (Resident Set Size) of every worker in real-time. Instead of waiting for the Linux OOM killer to violently terminate a process, we use predictive recycling. If a worker's memory growth curve indicates it will hit its limit within the next 5 minutes, we stop routing new URLs to it, let it finish its current queue, and cleanly restart the process.
05Did you know?
Simply closing a tab in Puppeteer or Playwright does not immediately free the memory back to the operating system. The V8 engine holds onto the memory heap until a garbage collection cycle runs. If you rapidly open and close tabs in a single browser context, your memory usage will spike dramatically before the GC has a chance to clean up, often resulting in a crash.
// 03 — capacity planning

How many workers
fit on a node?

Node capacity is a simple division problem, but you must account for the OS overhead and the orchestrator's safety buffer. DataFlirt provisions nodes at 85% maximum memory utilization to absorb sudden DOM spikes.

Max Concurrency = Cmax = (RAMtotalRAMos) / RAMworker_peak
Always divide by peak usage, not average, to survive target site anomalies. Infrastructure sizing model
Memory Leak Rate = ΔM = (RAMt2RAMt1) / Requests
If ΔM > 0 over a long window, the worker must be periodically recycled. DataFlirt telemetry
Cost per Worker = Node_Cost / Cmax
The financial metric that determines if a pipeline is commercially viable. FinOps standard
// 04 — memory profile trace

Watching a worker
bleed RAM.

A live memory profile of a Playwright worker scraping a React-heavy e-commerce site. Notice the steady climb in heap size until the garbage collector kicks in—or fails to.

Playwright/Node.jsheap snapshotOOM warning
edge.dataflirt.io — live
CAPTURED
// worker-04 initialization
[00:00:00] process.start: 84 MB
[00:00:02] browser.launch: 142 MB

// processing batch 1 (100 URLs)
[00:05:12] heap.used: 310 MB page.dom_nodes: 42,105
[00:10:45] heap.used: 580 MB page.dom_nodes: 89,400

// memory leak detected in target SPA
[00:15:22] heap.used: 890 MB WARN: approaching limit
[00:16:01] v8.garbage_collection: triggered
[00:16:03] heap.used: 810 MB WARN: GC ineffective

// orchestrator intervention
[00:16:05] cgroup.memory.limit: 1024 MB
[00:16:10] process.kill: SIGTERM sent
[00:16:12] worker-04: terminated (OOM prevention)
[00:16:15] worker-04-respawn: active (84 MB)
// 05 — the memory hogs

Where the RAM
actually goes.

The components of a scraping worker that consume the most memory, ranked by average footprint across DataFlirt's headless fleet.

SAMPLE SIZE ·  ·  ·  ·    10M+ sessions
WINDOW ·  ·  ·  ·  ·  ·   30d trailing
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Rendered DOM & CSSOM

~400-800 MB · SPAs with infinite scroll hold massive node trees
02

JavaScript Heap (V8)

~200-500 MB · Target site memory leaks become your memory leaks
03

Browser Process Overhead

~100-150 MB · The baseline cost of running Chromium
04

Network Response Buffers

~50-100 MB · Holding large JSON/HTML payloads before parsing
05

Proxy & TLS State

~10-30 MB · Connection pooling and certificate caches
// 06 — DataFlirt's architecture

Recycle early,

never wait for the OOM killer.

Memory leaks in modern web applications are inevitable. If a target site's React app leaks 2MB per page navigation, your long-running Playwright worker will eventually crash. DataFlirt doesn't try to fix the target's code. We use aggressive, telemetry-driven worker recycling. We track the Memory Usage Per Worker in real-time, and when a worker crosses its dynamic high-water mark, we gracefully drain its active requests, terminate the process, and spin up a fresh one. The pipeline never halts, and we never hit the hard cgroup limits.

Worker Telemetry Stream

Live metrics from a single worker node in our US-East cluster.

worker.id node-7a-worker-12
type playwright-chromium
uptime 02h 14m 30s
memory.current 642 MB
memory.limit 1024 MB
leak_rate.est +1.2 MB / req
recycle.scheduled in 45 requests
status draining gracefully

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About memory profiling, preventing crashes, handling heavy SPAs, and how DataFlirt scales headless infrastructure efficiently.

Ask us directly →
What is a normal memory footprint for a scraper? +
It depends entirely on the stack. A pure HTTP scraper (like Go's colly or Python's httpx) usually sits between 30–80 MB per worker. A headless browser (Playwright/Puppeteer) requires a baseline of ~150 MB just to launch, and can easily spike to 800 MB+ when rendering complex Single Page Applications or infinite scroll pages.
Why does my Playwright scraper use more memory over time? +
You are likely experiencing a memory leak, but it's probably not your code. Modern web apps often fail to properly garbage-collect detached DOM nodes or event listeners when navigating between views. Because your headless browser is executing their JavaScript, their memory leak becomes your memory leak. The solution is to recycle the browser context periodically.
How do I prevent Out-Of-Memory (OOM) crashes? +
Never run a worker indefinitely. Implement a maximum-requests-per-worker limit (e.g., restart the browser every 100 URLs). Additionally, block unnecessary resources like images, media, and third-party tracking scripts at the network level. This prevents the browser from allocating memory to render pixels you'll never look at.
Is it legal for sites to intentionally cause memory leaks to crash bots? +
Yes. This is a known anti-bot tactic called "tarpitting." A server might feed an infinite stream of junk data or a recursive JavaScript loop specifically designed to exhaust a scraper's heap. Legally, they are just serving code; operationally, it's your responsibility to set strict timeouts and memory limits on your workers to survive it.
How does DataFlirt handle memory spikes on massive e-commerce pages? +
We intercept and abort requests for non-essential assets (CSS, fonts, images) before they reach the rendering engine. For pages with massive DOMs, we extract the required JSON state directly from the page's script tags or XHR responses rather than waiting for the browser to build a 100,000-node DOM tree.
Should I scale vertically or horizontally for memory-hungry scrapes? +
Horizontally. Packing 50 headless browsers onto one massive 128GB RAM instance creates a massive blast radius. If one worker triggers a kernel panic or exhausts shared resources, you lose 50 workers. Distributing smaller batches of workers across many 8GB or 16GB nodes provides better fault isolation and more predictable memory management.
$ dataflirt scope --new-project --target=memory-usage-per-worker READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h