← Glossary / Puppeteer

What is Puppeteer?

Puppeteer is a Node.js library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. For scraping pipelines, it bridges the gap between static HTTP fetching and full JavaScript execution, allowing you to render single-page applications, intercept network requests, and interact with the DOM just like a human user. However, its default configuration is notoriously loud, leaking dozens of automation flags that modern anti-bot systems detect instantly.

HeadlessNode.jsCDPBrowser AutomationJavaScript Rendering
// 02 — definitions

Driving Chrome
via code.

The standard tool for rendering JavaScript-heavy pages, and the baseline against which all anti-bot detection scripts are calibrated.

Ask a DataFlirt engineer →

TL;DR

Puppeteer controls a real Chromium instance via the Chrome DevTools Protocol (CDP). It's essential for scraping SPAs and intercepting XHR requests, but out-of-the-box it leaks navigator.webdriver = true and other automation signatures. Running it at scale requires significant memory budgeting and fingerprint patching.

01Definition & structure
Puppeteer is an official Google Node.js library that provides a high-level API over the Chrome DevTools Protocol (CDP). It allows developers to programmatically launch Chromium, navigate to URLs, evaluate JavaScript in the page context, and intercept network traffic. It is the foundation for thousands of scraping pipelines, automated testing suites, and PDF generation services.
02How it works in practice
A typical Puppeteer script launches a browser instance, opens a new page, and navigates to a target URL. Unlike simple HTTP clients, Puppeteer waits for the DOM to load and JavaScript to execute. You can use page.waitForSelector() to ensure dynamic content has rendered, and page.evaluate() to run extraction logic directly inside the browser context, returning the structured data back to your Node.js environment.
03The fingerprinting problem
Puppeteer was built for testing, not stealth. By default, it broadcasts its automated nature loudly. It sets navigator.webdriver = true, uses a user-agent containing "HeadlessChrome", and injects specific CDP variables into the global window object. Modern anti-bot systems like Cloudflare and DataDome check for these exact signatures and will issue a 403 Forbidden or a CAPTCHA challenge before the page even finishes loading.
04How DataFlirt handles it
We maintain support for Puppeteer to run legacy client extraction scripts, but our infrastructure routes these jobs through a custom, managed browser pool. We don't rely on fragile JavaScript stealth plugins. Instead, we use Chromium binaries patched at the C++ level to remove CDP leaks and webdriver flags entirely. This ensures that even when a client submits a basic Puppeteer script, it executes with the fingerprint of a standard residential Chrome user.
05Did you know?
The core team that built Puppeteer at Google eventually moved to Microsoft to build Playwright. This is why the two APIs look remarkably similar, but Playwright benefits from years of hindsight regarding browser context isolation, auto-waiting, and cross-browser architecture.
// 03 — resource math

How expensive
is a browser?

Headless browsers are massive resource hogs compared to HTTP clients. DataFlirt's fleet scheduler uses these baseline calculations to pack containers without triggering Out-Of-Memory (OOM) kills during concurrent page loads.

Memory per worker = M = Base + (Tabs × DOM_Size)
Base ~150MB; each tab adds 50–200MB depending on media and JS heap. Fleet telemetry
CPU utilization = C = JS_Execution + Layout_Paint
Blocking ads, fonts, and images reduces C by ~40%. Chrome DevTools profiling
DataFlirt concurrency limit = W = (Node_RAM / Mpeak) × 0.85
15% safety buffer for memory spikes prevents cascading container crashes. Internal SLO
// 04 — cdp trace

Intercepting XHR
before it renders.

A live Puppeteer trace intercepting a GraphQL pricing endpoint. We block images and analytics to save bandwidth, then capture the JSON response directly without parsing the DOM.

Node.jsCDPNetwork Interception
edge.dataflirt.io — live
CAPTURED
// init browser context
browser.launch: headless: true, args: ['--disable-gpu']
page.setRequestInterception: true

// request routing
req.resourceType: "image" → abort()
req.resourceType: "font" → abort()
req.url: "https://api.target.com/graphql" → continue()

// response capture
res.status: 200
res.headers['content-type']: "application/json"
data.extracted: 42 records

// teardown
page.close: success
memory.freed: 118 MB
// 05 — detection vectors

Why default Puppeteer
gets blocked.

Anti-bot systems look for specific CDP artifacts and JavaScript runtime leaks that prove the browser is being driven by automation rather than a human.

BLOCK RATE ·  ·  ·  ·  ·  99% on default
VENDORS ·  ·  ·  ·  ·  ·  Cloudflare, DataDome
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

navigator.webdriver flag

100% confidence · Hardcoded to true in headless mode
02

CDP runtime variables

High confidence · cdc_adoQpoasnfa76pfcZLmcfl_ injected by CDP
03

Headless User-Agent

High confidence · Contains 'HeadlessChrome' string
04

Missing plugins/languages

Medium confidence · Empty arrays in headless environments
05

WebGL vendor strings

Medium confidence · Mesa / SwiftShader software rendering
// 06 — operational scale

Browsers are heavy,

so we only use them when HTTP fails.

Running Puppeteer for every request is a rookie mistake that inflates cloud bills by 10x. At DataFlirt, we use headless browsers strictly as a fallback or for token harvesting. If the data is in the initial HTML or a discoverable API, we use fast, concurrent HTTP clients. When we do need a browser, we route it through our patched Chromium builds that strip CDP leaks at the binary level, bypassing the need for fragile stealth plugins.

Puppeteer Job Profile

Resource consumption for a single SPA render job.

job.type spa_render
browser.engine chromium_124_patched
memory.peak 184 MB
network.blocked images, media, fonts
stealth.status binary_patch
execution.time 1.8s
bot_score 0.04

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About Puppeteer, Playwright, stealth plugins, and how to scale browser automation without bankrupting your infrastructure.

Ask us directly →
Should I use Puppeteer or Playwright? +
For new scraping projects, use Playwright. It was built by the same core team that created Puppeteer but offers better cross-browser support (WebKit, Firefox), auto-waiting mechanisms, and a more robust architecture for handling multiple contexts. Puppeteer remains heavily used in legacy codebases, but Playwright is the modern standard.
Does puppeteer-extra-stealth still work? +
Barely. Stealth plugins rely on JavaScript injections to overwrite properties like navigator.webdriver. Modern anti-bot vendors execute their checks before these injections run, or use iframe sandboxing to bypass the overrides. True stealth requires patching the Chromium binary itself, which is how DataFlirt operates.
Is it legal to scrape with a headless browser? +
The legality of scraping depends on the data accessed and the terms of service, not the tool used. Whether you use curl, Puppeteer, or a manual copy-paste, accessing public data is generally lawful. However, using a browser to bypass technical barriers or execute authenticated actions carries different legal weight.
How do you handle memory leaks in Puppeteer? +
Never keep a single browser instance alive indefinitely. The standard pattern is to launch a browser, process a batch of URLs using isolated browser contexts, and then kill the entire browser process after a set number of requests (e.g., 100). This guarantees memory is reclaimed by the OS.
How does DataFlirt scale browser automation? +
We separate the orchestration layer from the browser layer. Our workers request a WebSocket endpoint from a managed browser pool. The pool handles instance lifecycle, proxy routing, and binary patching. If a browser crashes, the worker just reconnects to a new one.
Can I block images to speed up scraping? +
Yes, and you should. Using page.setRequestInterception(true) to abort requests for images, stylesheets, and fonts reduces bandwidth by up to 80% and significantly lowers CPU usage. Just ensure the target site doesn't rely on image load events to trigger the data rendering you need.
$ dataflirt scope --new-project --target=puppeteer READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h