← Glossary / Scrapy-Playwright

What is Scrapy-Playwright?

Scrapy-Playwright is a download handler that bridges Scrapy's high-concurrency spider architecture with Playwright's headless browser rendering capabilities. It allows data engineers to scrape JavaScript-heavy single-page applications (SPAs) without abandoning Scrapy's robust item pipelines, middleware, and scheduling ecosystem. While powerful, it introduces significant memory overhead and requires careful context management to prevent zombie browser processes from crashing the scraping node.

Scraping InfrastructureJavaScript RenderingHeadless BrowserPythonConcurrency
// 02 — definitions

The best of
both worlds.

Combining the industry-standard crawling framework with modern browser automation — and the operational headaches that come with it.

Ask a DataFlirt engineer →

TL;DR

Scrapy-Playwright replaces Scrapy's default HTTP downloader with a Playwright-backed engine. It routes requests through real browser contexts, executes JavaScript, and returns the fully rendered DOM to your spider callbacks. It solves the SPA scraping problem but reduces your node's concurrency from thousands of requests per second to dozens.

01Definition & structure
Scrapy-Playwright is a Python library that integrates Microsoft's Playwright into the Scrapy ecosystem. It acts as a custom download handler. When a Scrapy spider yields a request with the playwright=True meta flag, the request bypasses Scrapy's default Twisted HTTP downloader. Instead, it is routed to a headless browser instance managed by Playwright, which fetches the URL, executes the JavaScript, waits for the network to idle, and returns the fully rendered HTML back to the spider's parse callback.
02How it works in practice
You install the plugin and update your settings.py to replace the default HTTP/HTTPS download handlers. In your spider, you write standard Scrapy callbacks. The magic happens in the yield scrapy.Request() call — by passing specific Playwright arguments, you can instruct the browser to wait for specific CSS selectors, click buttons, or scroll the page before capturing the HTML. The spider remains completely agnostic to the fact that a full Chromium instance is doing the heavy lifting.
03Context and page management
Playwright operates using Browser Contexts (isolated incognito-like sessions) and Pages (tabs). Scrapy-Playwright manages a pool of these contexts asynchronously. To maintain performance, you configure the plugin to reuse contexts across multiple requests while strictly limiting the maximum number of concurrent pages. If you allow too many concurrent pages, the Python asyncio event loop will choke, and the OS will kill the process due to out-of-memory (OOM) errors.
04How DataFlirt handles it
We rarely run pure Scrapy-Playwright spiders. Instead, we build hybrid spiders. Our custom middleware intercepts requests and attempts a fast, cheap HTTP fetch first. We parse the raw HTML; if the required data is missing (e.g., a React root div is empty), we drop the response and requeue the URL with Playwright enabled. This ensures we only pay the massive CPU/RAM tax of browser rendering when the target site absolutely forces us to.
05The memory leak trap
The most common failure mode for Scrapy-Playwright in production is the slow memory leak. If a spider callback throws an unhandled exception before the Playwright page is properly closed, that browser tab becomes a "zombie." It remains open in the background, consuming ~150MB of RAM. Over a 24-hour crawl, these zombies accumulate until the server crashes. Robust error handling and strict context-rotation policies are mandatory for long-running hybrid spiders.
// 03 — the concurrency model

How many pages
can you render?

Adding a headless browser to Scrapy fundamentally changes its performance profile. DataFlirt models Playwright concurrency based on available RAM, not network I/O.

Max concurrent pages = Cmax = (RAMtotalOSbase) / RAMper_page
Expect ~150–300MB per Playwright page context depending on media. Infrastructure sizing model
Effective crawl rate = Reff = Cmax / Trender
T_render includes network idle time and JS execution, often 2–5 seconds. Scrapy-Playwright throughput
DataFlirt hybrid ratio = H = Reqplaywright / Reqhttpx
We keep H < 0.05. Only render JS when strictly necessary. Internal SLO
// 04 — spider execution trace

A Scrapy-Playwright
request lifecycle.

Trace of a single item extraction from a React-based e-commerce site, showing the handoff between Scrapy's async engine and the Playwright context.

asyncioplaywright contextyield item
edge.dataflirt.io — live
CAPTURED
// Scrapy engine yields request
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://target.com/spa-product>

// Scrapy-Playwright handler takes over
[scrapy-playwright] INFO: Allocating new browser context...
[playwright.network] DEBUG: Requesting main document
[playwright.page] WARN: Blocked 14 tracking scripts
[playwright.page] INFO: Waiting for network idle (500ms)
[scrapy-playwright] INFO: Page rendered. Returning HTML to spider.

// Back to Scrapy spider callback
[spider.parse] DEBUG: Extracted item: {"price": 149.99, "stock": true}
[scrapy-playwright] INFO: Closing browser context
[scrapy.statscollect] INFO: item_scraped_count: 1
// 05 — failure modes

Where hybrid
crawlers break.

Ranked by frequency of pipeline failure across DataFlirt's hybrid Scrapy deployments. Browser automation introduces stateful failure modes that pure HTTP scrapers never encounter.

HYBRID SPIDERS ·  ·  ·    140+ active
AVG RAM/NODE ·  ·  ·  ·   16 GB
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Zombie browser contexts

Memory leak · Contexts fail to close, exhausting node RAM over time
02

Network idle timeouts

Hangs · Third-party scripts never finish loading, blocking the yield
03

Event loop blocking

Async error · Sync code in async callbacks freezes the entire Scrapy engine
04

Unhandled dialogs

State trap · Unexpected JS alerts pause the Playwright page indefinitely
05

Anti-bot detection

Fingerprint · Default Playwright signatures caught by Cloudflare/DataDome
// 06 — our architecture

Render only when necessary,

fallback to HTTP immediately.

At DataFlirt, we treat Scrapy-Playwright as a surgical tool, not a default downloader. Our middleware intercepts requests and attempts a pure HTTP fetch first. If the target fields are missing from the raw DOM (indicating client-side rendering), the request is requeued with the playwright=True meta flag. This hybrid approach maintains 95% of pure Scrapy's speed while guaranteeing 100% data completeness for SPA targets.

hybrid-spider.stats

Live telemetry from a hybrid Scrapy node processing a React-based catalog.

node.id sp-hybrid-04
req.total 45,210
req.pure_http 42,90094.8%
req.playwright 2,3105.2%
browser.contexts 12 active
memory.usage 4.2 GB75% capacity
items.extracted 45,198synced

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About integrating Playwright with Scrapy, managing memory, handling proxies, and scaling hybrid infrastructure.

Ask us directly →
What is the difference between Scrapy-Playwright and Scrapy-Splash? +
Splash is effectively deprecated. It relies on an outdated QtWebKit engine and a custom Lua scripting interface. Scrapy-Playwright uses modern Chromium/Firefox/WebKit binaries via the Chrome DevTools Protocol (CDP), offering perfect modern web compatibility and native Python async/await support.
Can I route Scrapy-Playwright requests through a proxy? +
Yes, but it's handled differently than standard Scrapy requests. You must pass the proxy configuration via the playwright_context_kwargs in the request meta, rather than relying on Scrapy's standard HttpProxyMiddleware, because Playwright manages its own network stack at the browser level.
How do I prevent memory leaks when running it for days? +
Limit the maximum number of pages per context and force context rotation. Set PLAYWRIGHT_MAX_PAGES_PER_CONTEXT in your settings. Additionally, aggressively block images, fonts, and media via Playwright's route interception to keep the memory footprint per page under 100MB.
How does DataFlirt scale Scrapy-Playwright pipelines? +
We deploy them on Kubernetes pods with strict memory limits (usually 8GB–16GB) and auto-scale horizontally based on the Playwright request queue depth. If a pod OOMs due to a zombie context, the orchestrator kills it and the dead-letter queue safely redistributes the lost requests to healthy nodes.
Is it legal to scrape SPAs using browser automation? +
Rendering JavaScript does not change the legal framework of web scraping. The same rules apply: respect robots.txt, do not bypass authentication to access private data, and do not cause denial-of-service conditions. Browser automation simply changes the technical mechanism of access, not the authorization level.
Why not just use raw Playwright without Scrapy? +
Raw Playwright is a browser automation tool, not a crawling framework. If you drop Scrapy, you lose built-in duplicate filtering, robust retry middleware, concurrent request scheduling, item pipelines, and standard export formats. Scrapy-Playwright lets you keep the data engineering infrastructure while solving the rendering problem.
$ dataflirt scope --new-project --target=scrapy-playwright READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h