← Glossary / Playwright

What is Playwright?

Playwright is an open-source automation library developed by Microsoft that controls Chromium, WebKit, and Firefox via a single API. Unlike older WebDriver-based tools, it communicates directly with the browser's DevTools protocol over WebSockets, enabling bidirectional event listening and zero-latency DOM manipulation. For modern scraping pipelines, it is the default engine for executing JavaScript-heavy targets, though its out-of-the-box fingerprint is instantly flagged by enterprise anti-bot systems.

HeadlessCDPAutomationJavaScript RenderingBrowser Engine

// 02 — definitions

Beyond the
DOM.

Why the shift from HTTP fetching to full browser automation became mandatory, and how Playwright won the infrastructure war.

Ask a DataFlirt engineer →

TL;DR

Playwright executes real browsers to render React, Vue, and Angular applications before extracting data. It replaced Puppeteer and Selenium by offering cross-browser support, auto-waiting, and network interception out of the box. However, running Playwright at scale introduces massive compute overhead and requires extensive fingerprint patching to survive Cloudflare or DataDome.

01Definition & structure

Playwright is a Node.js, Python, Java, and .NET library that provides a high-level API to control headless browsers. It operates by communicating directly with the browser engine (Chromium, WebKit, or Firefox) using the Chrome DevTools Protocol (CDP) over a WebSocket connection. This architecture allows it to intercept network traffic, mock responses, and execute JavaScript in the page context synchronously.

02How it works in practice

In a scraping pipeline, Playwright is deployed to handle Single Page Applications (SPAs) where the initial HTTP response is just an empty <div id="root"></div>. Playwright loads the page, executes the React/Vue bundles, waits for the network requests to settle (auto-waiting), and then serializes the fully rendered DOM back to the scraper for extraction.

03The fingerprinting problem

By default, Playwright announces itself as a bot. It sets navigator.webdriver = true, exposes CDP-specific variables in the global window object, and uses a distinct TLS fingerprint. When you point a vanilla Playwright script at a Cloudflare-protected site, the edge worker reads these signals and serves a 403 Forbidden or an infinite CAPTCHA loop before the page even begins to render.

04How DataFlirt handles it

We treat Playwright as a rendering engine, not a stealth tool. Our infrastructure runs custom-compiled Chromium binaries that have automation flags removed at the C++ level. We manage the TLS handshake and proxy routing outside of Playwright entirely. When a pipeline requires JS rendering, the request is routed to a warm, pre-patched browser context, ensuring the execution environment is indistinguishable from a standard consumer browser.

05Playwright vs Puppeteer

Playwright was built by the same core team that originally created Puppeteer at Google, before they moved to Microsoft. While Puppeteer is strictly tied to Chromium, Playwright supports WebKit and Firefox. More importantly for scraping, Playwright introduced the concept of "Browser Contexts"—isolated incognito-like sessions that share a single browser process but have separate cookies and cache, drastically reducing the memory overhead of concurrent scraping.

// 03 — the compute cost

What does rendering
actually cost?

Running a full browser is orders of magnitude more expensive than a raw HTTP GET. DataFlirt's fleet scheduler uses these models to decide when to fall back to Playwright and when to stick to raw HTTP.

Memory per context = M_base + (Tabs × M_tab) + DOM_Leak

Chromium base is ~150MB. Each active tab adds 30–80MB depending on DOM size. DataFlirt infrastructure telemetry

Execution latency = T_network + T_render + T_{idle_wait}

NetworkIdle events typically add 500ms–2s of artificial delay to every scrape. Playwright lifecycle events

DataFlirt render ratio = Req_playwright / Req_total

Target ratio is < 0.15. We only render when the target data is provably absent from the raw HTML. Internal SLO

// 04 — CDP trace

Intercepting network
requests on the fly.

A live trace of Playwright blocking images and analytics scripts via the Chrome DevTools Protocol (CDP) to reduce bandwidth and speed up the DOM ready state.

CDPNetwork InterceptionChromium

edge.dataflirt.io — live

CAPTURED

// init browser context
cdp.connect: "ws://127.0.0.1:9222/devtools/browser/..."
context.new: success "viewport: 1920x1080"

// route interception rules
route.add: "**/*.{png,jpg,jpeg,webp,gif}" abort
route.add: "**/*analytics*.js" abort

// page navigation
page.goto: "https://target-ecommerce.com/category/laptops"
net.request: "GET /category/laptops" 200 OK
net.request: "GET /assets/hero-banner.jpg" ABORTED (blocked by client)
net.request: "GET /api/v1/products?cat=laptops" 200 OK

// lifecycle events
event.domcontentloaded: 412ms
event.networkidle: 1205ms

// extraction
page.evaluate: "document.querySelectorAll('.product-card').length"
result: 24 extracted

// 05 — execution bottlenecks

Where Playwright
loses milliseconds.

The primary latency contributors when running headless Chromium at scale. Numbers reflect median execution times across our residential proxy pool.

SAMPLE SIZE · · · · 1.2M renders

ENVIRONMENT · · · · Linux / Xvfb

UPDATED · · · · · · 2026-05-19

01

Proxy TLS negotiation

network bound · Residential proxy handshakes add massive latency

02

JavaScript execution

CPU bound · Parsing and executing heavy React/Vue bundles

03

NetworkIdle waiting

logic bound · Waiting for background XHRs to settle

04

Browser context init

I/O bound · Spinning up isolated profiles per request

05

CDP serialization

memory bound · Passing large DOM snapshots over WebSockets

// 06 — our stack

Patched at the source,

not just at the JavaScript layer.

Standard Playwright stealth plugins rely on JavaScript to overwrite navigator properties, which modern anti-bot scripts easily bypass by checking the prototype chain. DataFlirt compiles custom Chromium binaries that strip automation flags at the C++ level. We run these patched browsers on bare-metal nodes, proxying the CDP connection so the client script never executes on the same machine as the browser. This physical separation prevents timing attacks from detecting the Node.js event loop.

Playwright Worker Node

Live telemetry from a DataFlirt rendering node executing a JS-heavy target.

engine.binary df-chromium-v124.0.6367.60

stealth.level native C++ patch

navigator.webdriver undefined

cdp.proxy active · 12ms latency

memory.usage 1.4 GB / 32 GBhealthy

active.contexts 14 concurrent

bot.score 0.04human

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about Playwright scaling, stealth modes, resource consumption, and DataFlirt's rendering infrastructure.

Ask us directly →

Why use Playwright instead of Selenium? +

Selenium uses the WebDriver protocol, which relies on HTTP endpoints to send commands to the browser. This introduces latency and makes bidirectional communication (like intercepting network requests) cumbersome. Playwright uses the Chrome DevTools Protocol (CDP) over WebSockets, allowing it to listen to network events, modify requests on the fly, and execute JavaScript with zero HTTP overhead.

Can Playwright bypass Cloudflare or DataDome? +

Out of the box, absolutely not. Playwright sets navigator.webdriver = true and leaves dozens of CDP-specific fingerprints. While plugins like playwright-stealth help against basic checks, enterprise WAFs detect them via JS prototype chain inspection. Bypassing tier-1 anti-bots requires compiling custom browser binaries and managing TLS fingerprints at the network layer.

How much RAM does a Playwright instance need? +

A bare Chromium instance takes ~150MB. Each isolated browser context (tab) adds 30–80MB depending on the DOM complexity. If you are running 20 concurrent tabs, expect to provision at least 2GB of RAM. Memory leaks are common in long-running Playwright processes; you must aggressively close contexts and restart the browser periodically.

Is scraping with Playwright legal? +

The tool you use does not dictate legality. Scraping public data is generally lawful (reinforced by hiQ v. LinkedIn), whether you use a raw HTTP client or Playwright. However, if you use Playwright to bypass authentication, solve CAPTCHAs to access gated content, or ignore robots.txt directives, you expose yourself to CFAA or ToS breach claims.

How does DataFlirt scale Playwright? +

We don't run Playwright on serverless functions (AWS Lambda) due to cold starts and memory limits. We run persistent, patched Chromium binaries on bare-metal Kubernetes clusters. We pool browser contexts and route incoming extraction requests to warm tabs via a custom CDP proxy, reducing the per-request overhead from 2 seconds to under 100ms.

Should I use Playwright for every target? +

No. Rendering a page with Playwright costs roughly 10x to 50x more in compute and time than a raw HTTP GET. Always inspect the target's network traffic first. If the data is in the initial HTML payload or available via a backend JSON API, use a standard HTTP client. Only use Playwright when the data is strictly rendered client-side and the API is heavily secured.

$ dataflirt scope --new-project --target=playwright READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h