← Glossary / Scraper Out-of-Memory Error

What is Scraper Out-of-Memory Error?

Scraper Out-of-Memory Error (OOM) occurs when a scraping process exhausts its allocated RAM, causing the OS or container runtime to forcefully terminate the worker. In headless browser scraping, this is almost always caused by zombie DOM nodes, unclosed browser contexts, or accumulating large JSON payloads in memory before writing to disk. When an OOM kills a worker mid-run, the pipeline drops records, corrupts state, and requires expensive manual backfills.

OOMMemory LeakHeadless BrowsersNode.jsContainer Limits
// 02 — definitions

The silent
pipeline killer.

Why your scraping workers are crashing at 3 AM, and how memory leaks compound across long-running extraction jobs.

Ask a DataFlirt engineer →

TL;DR

A scraper out-of-memory error happens when your Node.js or Python process exceeds its heap limit, or when a Docker container hits its cgroup memory ceiling. It's the most common failure mode for Playwright and Puppeteer pipelines, where unclosed pages and detached DOM elements silently consume gigabytes of RAM until the kernel's OOM killer steps in.

01Definition & structure
An Out-of-Memory (OOM) Error happens when a scraping script attempts to allocate more RAM than the system or runtime allows. In Node.js, this throws a FATAL ERROR: JavaScript heap out of memory. In Python, it raises a MemoryError. In containerized environments like Kubernetes or Docker, the OS kernel's OOM Killer will abruptly terminate the process with exit code 137, often leaving no stack trace in the application logs.
02The headless browser trap
Headless browsers are the leading cause of OOMs. A single Chromium instance can easily consume 200MB of RAM. If a scraper opens new tabs (pages) in a loop without properly closing the parent BrowserContext, the browser retains the cache, cookies, and detached DOM nodes for every visited URL. Over thousands of iterations, this silent accumulation exhausts the host machine's memory.
03Data accumulation leaks
Beyond browser overhead, the most common architectural flaw causing OOMs is storing extracted data in memory. A script that pushes scraped JSON objects into a global results = [] array will eventually crash on large crawls. Production pipelines must stream data — writing records to an NDJSON file, a database, or an S3 buffer immediately after extraction, allowing the garbage collector to free the memory.
04How DataFlirt handles it
We design our infrastructure around the assumption that all complex scraping tasks eventually leak memory. Our worker nodes are ephemeral. The orchestrator monitors the heap usage of every active container. When a worker reaches 80% of its memory limit, it is instructed to finish its current URL, flush its state to Redis, and gracefully exit. A fresh worker immediately spins up to take its place, ensuring 100% uptime with zero dropped records.
05Did you know?
By default, Node.js caps its heap size at around 1.5GB on 64-bit systems, regardless of how much physical RAM your server has. If you run a scraper on a 32GB AWS instance without passing the --max-old-space-size flag, Node will still crash with an OOM error while leaving 30GB of your server's memory completely untouched.
// 03 — memory math

How fast will
you crash?

Memory exhaustion is a function of leak rate per page and the number of pages processed per worker lifecycle. DataFlirt calculates this to set optimal worker recycling intervals.

Time to OOM = T = (RAMlimitRAMbase) / (Leakpage × Rate)
If you leak 5MB per page at 2 pages/sec, a 2GB container dies in ~3 hours. Capacity Planning
Node.js Heap Limit = max_old_space_size = 1536 MB
Default V8 heap limit. Node will crash here even if the server has 64GB of RAM. V8 Engine Defaults
DataFlirt Worker Lifespan = Lmax = 0.8 × (RAMlimit / Leakavg)
We gracefully recycle workers at 80% capacity to guarantee zero dropped records. Internal SLO
// 04 — the crash trace

A memory leak
in real time.

A Node.js Playwright worker scraping a single-page application. Notice the heap growth over 40 minutes before the kernel intervenes.

Node.jsPlaywrightOOM Killer
edge.dataflirt.io — live
CAPTURED
// worker init
heap.used: 124 MB

// 10 minutes (1,500 pages)
heap.used: 640 MB
browser.contexts: 45 // warning: unclosed contexts

// 30 minutes (4,500 pages)
heap.used: 1420 MB
gc.pause: 450ms // garbage collection thrashing

// 42 minutes
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
kernel: Out of memory: Killed process 1402 (node)
worker.status: CRASHED
pipeline.recovery: restarting worker 04
// 05 — leak sources

Where the RAM
actually goes.

The most common culprits for memory exhaustion in scraping pipelines, ranked by frequency across DataFlirt's incident post-mortems.

INCIDENTS ANALYSED ·  ·   1,200+ OOMs
PRIMARY STACK ·  ·  ·  ·  Node.js / Python
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Unclosed browser contexts

Playwright/Puppeteer · Pages closed, but context retains cache
02

In-memory data arrays

Architecture flaw · Accumulating records instead of streaming
03

Detached DOM elements

JS execution · Evaluating scripts that retain node references
04

Unbounded request queues

Network layer · Interception queues growing faster than processed
05

Heavy HTML parsers

Data extraction · Loading 10MB strings into Cheerio/JSDOM
// 06 — DataFlirt's architecture

Assume it leaks,

and recycle before it crashes.

You cannot write a perfectly memory-safe browser automation script for the modern web. Single-page applications leak memory by design, and headless browsers inherit those leaks. DataFlirt's infrastructure assumes every worker is slowly dying. We enforce strict memory bounds and proactively recycle worker processes when they hit 80% of their heap limit. State is persisted externally, so a recycled worker picks up the exact URL it left off without dropping a single record.

worker-memory-profile

Live telemetry of a DataFlirt extraction worker hitting its recycle threshold.

worker.id w-492-alpha
heap.limit 2048 MB
heap.used 1640 MB
action graceful_recycle
state.persisted true
records.dropped 0
uptime 4h 12m

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About memory leaks, container limits, headless browser overhead, and how DataFlirt prevents OOMs on massive crawls.

Ask us directly →
Why does my scraper work locally but OOM in Docker? +
Locally, your process can consume all available system RAM and swap space. In Docker, it is constrained by cgroup limits. If you set a container memory limit of 1GB but don't configure Node.js or Python to respect that limit, the runtime will try to allocate more memory than the container allows, triggering an immediate kernel kill.
How do I fix Playwright memory leaks? +
The most common mistake is closing page objects but leaving context objects open. Browser contexts isolate cookies and cache; if you don't call context.close(), that data stays in RAM forever. Additionally, avoid storing large objects in the Node.js scope that reference JSHandles inside the browser.
Does increasing --max-old-space-size fix the problem? +
No, it only delays it. If you have a memory leak, giving the process 8GB of RAM instead of 1.5GB just means it will crash in 12 hours instead of 2 hours. It also makes garbage collection pauses significantly longer, which can cause network timeouts. Fix the leak or implement worker recycling.
How does DataFlirt prevent OOMs on massive crawls? +
We use a combination of streaming writes and ephemeral workers. Extracted data is immediately flushed to disk or S3 — never held in an array. Workers are treated as disposable; our orchestrator monitors heap usage and gracefully shuts down and replaces workers before they hit the critical threshold.
What's the difference between a memory leak and high memory usage? +
High memory usage is a flat plateau — the scraper loads a 50MB JSON file, parses it, and RAM usage stays high but stable. A memory leak is a continuous upward slope — RAM usage grows by 2MB on every loop iteration and never drops, eventually guaranteeing a crash.
Should I use Cheerio instead of Puppeteer to save memory? +
Yes, absolutely. If the target data is in the initial HTML response, using a headless browser is a massive waste of RAM and CPU. Cheerio or lxml parses raw text and uses a fraction of the memory footprint of a full Chromium instance. Reserve browsers strictly for JavaScript-rendered content.
$ dataflirt scope --new-project --target=scraper-out-of-memory-error READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h