← Glossary / Memory Leak in Scraper

What is Memory Leak in Scraper?

Memory leak in scraper occurs when an extraction process continuously allocates memory without releasing it back to the operating system, eventually exhausting available RAM. In headless browser pipelines, this typically stems from unclosed contexts, detached DOM nodes, or accumulating event listeners. Left unchecked, the operating system's OOM killer terminates the worker, causing silent data loss, dropped records, and cascading pipeline failures.

OOMGarbage CollectionHeadless BrowsersNode.jsInfrastructure
// 02 — definitions

RAM goes in,
nothing comes out.

The slow, silent killer of long-running extraction jobs. Why your scraper runs perfectly for an hour and then vanishes without a stack trace.

Ask a DataFlirt engineer →

TL;DR

A memory leak happens when your scraper holds references to objects it no longer needs. In Python or Node.js, the garbage collector cannot free these objects. The process footprint grows until the OS intervenes with a SIGKILL, terminating the worker mid-extraction and dropping any unflushed data.

01Definition & structure
A memory leak in a scraping context occurs when the application allocates memory for an operation (like opening a page, parsing a DOM, or storing a record) but fails to release that memory back to the system when the operation is complete. Because the garbage collector still sees active references to these objects, it cannot free them. Over time, the process's memory footprint grows linearly with the number of pages scraped until it hits the system limit.
02How it works in practice
Most scraping scripts start small and fast. As the loop iterates, unclosed resources accumulate. The garbage collector runs more frequently and takes longer to execute (GC thrashing), causing the scraper's CPU usage to spike and its extraction speed to plummet. Eventually, the operating system detects that the process has exceeded its allowed memory bounds and issues a SIGKILL. The script dies instantly, taking any unflushed data in memory with it.
03Common causes in scraping
In modern web scraping, the most notorious culprits are headless browsers. Failing to call context.close() in Playwright or Puppeteer leaves the entire browser context (cookies, cache, DOM tree) in memory. Other common causes include appending extracted records to a global array instead of streaming them to disk, or attaching event listeners (like page.on('response')) inside a loop without removing them, creating thousands of duplicate listeners.
04How DataFlirt handles it
We treat memory leaks as an inevitability of running third-party web code. Our infrastructure relies on graceful worker recycling. Every extraction worker is monitored for RSS memory growth. When a worker approaches 85% of its container limit, the orchestrator stops assigning it new tasks. The worker finishes its active extraction, flushes its payload to the delivery queue, and shuts down cleanly. A fresh container takes its place, ensuring the pipeline runs indefinitely without OOM crashes.
05The SPA inheritance problem
Many developers spend hours debugging their scraper's memory profile, only to discover their code is flawless. The leak is actually in the target website. Single-Page Applications (SPAs) are notorious for leaking memory during client-side routing. If your scraper navigates through 1,000 products on an SPA without forcing a hard page reload, the browser tab will crash due to the target site's own detached DOM nodes.
// 03 — the memory model

How fast will
the worker die?

Memory exhaustion is a function of leak rate per page and the worker's total memory limit. DataFlirt's orchestrator monitors the derivative of memory growth to preemptively recycle workers before they hit the ceiling.

Time to OOM = Toom = (MlimitMbase) / Lrate
Time until the OS kills the process. M is memory, L is leak per second. Systems Engineering 101
Leak Rate = Lrate = ΔM / Δt
Measured over a rolling 5-minute window to smooth out GC pauses. DataFlirt telemetry
DataFlirt Worker Lifespan = Wmax = min(10000 pages, Mthreshold)
Workers are gracefully recycled at 85% memory capacity or 10k pages. Internal SLO
// 04 — the OOM trace

Watching a worker
bleed to death.

A Node.js Playwright worker scraping a single-page application. The developer forgot to close browser contexts between iterations. Watch the heap grow until the kernel steps in.

Node.jsPlaywrightSIGKILL
edge.dataflirt.io — live
CAPTURED
// worker initialization
worker.id: "ext-node-04" pid: 14092
mem.rss: 124 MB nominal

// iteration 100
pages.processed: 100
mem.rss: 450 MB growing
gc.pause: 12 ms

// iteration 850
pages.processed: 850
mem.rss: 1.8 GB critical
gc.pause: 840 ms // thrashing

// kernel intervention
dmesg: Out of memory: Killed process 14092 (node)
worker.status: SIGKILL (signal 9)
pipeline.state: job failed — 14 records lost in buffer
// 05 — leak sources

Where the bytes
get trapped.

Ranked by frequency across failed client scripts audited by DataFlirt engineers. Headless browser mismanagement accounts for the vast majority of fatal leaks.

AUDITED SCRIPTS ·  ·  ·   1,200+
PRIMARY CAUSE ·  ·  ·  ·  Playwright contexts
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Unclosed browser contexts

Playwright / Puppeteer · Failing to call context.close()
02

Global array accumulation

Application logic · Appending records without flushing to disk
03

Detached DOM nodes

Browser engine · Holding JS references to elements after navigation
04

Unresolved Promises

Async flow · Hanging network requests preventing garbage collection
05

Event listener accumulation

Node.js · Adding page.on('request') inside a loop
// 06 — our architecture

Assume everything leaks,

and build the orchestrator to survive it.

You cannot write a perfectly leak-free browser automation script for the modern web. Single-page applications leak memory natively; your scraper just inherits their bugs. DataFlirt's orchestrator assumes every worker is dying from the moment it boots. We monitor the RSS heap size and the GC pause duration. When a worker crosses the 85% memory threshold, the orchestrator stops routing new URLs to it, lets it finish its current extraction, flushes the buffer to S3, and gracefully terminates the container. A fresh worker takes its place. Zero dropped records, zero pager alerts.

Worker lifecycle telemetry

Live memory profile of a DataFlirt extraction worker.

worker.id ext-pool-992
pages.extracted 4,102nominal
mem.rss_current 1.2 GB
mem.rss_limit 2.0 GB
gc.pause_avg 14mshealthy
leak.derivative +1.4 MB/page
action graceful_recycle_pending

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about diagnosing, preventing, and surviving memory leaks in high-volume scraping pipelines.

Ask us directly →
Why doesn't a try/catch block handle memory leaks? +
Because an Out-Of-Memory (OOM) event is not an application-level exception. When your scraper exhausts available RAM, the Linux kernel's OOM killer steps in and sends a SIGKILL (Signal 9) to the process. It cannot be caught, intercepted, or handled by your code. The process simply ceases to exist.
How do I find the leak in my Node.js scraper? +
Use the --inspect flag and take heap snapshots via Chrome DevTools. Take one snapshot after 10 pages, and another after 100 pages. Compare the two and look for objects allocated between snapshots that were never garbage collected. In Playwright, it's almost always an unclosed BrowserContext.
Is it my code leaking, or the target website? +
Often, it's the website. Modern Single-Page Applications (SPAs) like React or Next.js frequently leak memory natively. If you navigate a single SPA page 500 times without doing a hard reload, the browser tab will crash. This is why you should isolate extractions into fresh browser contexts and close them frequently.
How does DataFlirt prevent data loss during an OOM? +
We don't let it OOM. Our orchestrator tracks the RSS memory of every worker container. When a worker hits 85% of its limit, it is marked as "draining." It finishes its current URL, flushes its data buffer to the delivery sink, and shuts down gracefully. A new worker spins up to take the next URL in the queue.
Should I just increase the memory limit on my server? +
No. Increasing the memory limit from 2GB to 8GB doesn't fix a memory leak; it just changes the crash interval from every 2 hours to every 8 hours. It's a temporary band-aid that wastes infrastructure budget. Fix the leak, or implement graceful worker recycling.
What's the difference between a memory leak and high memory usage? +
High memory usage is static — your scraper loads a massive 500MB JSON file into memory, processes it, and frees it. The footprint is large, but stable. A memory leak is dynamic — the footprint grows continuously over time, page by page, until the system crashes.
$ dataflirt scope --new-project --target=memory-leak-in-scraper READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h