← Glossary / Scrapy Splash

What is Scrapy Splash?

Scrapy Splash is a legacy integration that connects the Scrapy framework to Splash, a lightweight, scriptable browser based on Qt WebKit. It was historically the standard way to render JavaScript-heavy pages without breaking Scrapy's asynchronous event loop. Today, it is largely obsolete—replaced by modern CDP-based integrations like Scrapy-Playwright—but remains embedded in thousands of older enterprise data pipelines.

ScrapyJS RenderingLegacyTwistedWebKit

// 02 — definitions

The legacy
render bridge.

How Python developers historically solved the JavaScript rendering problem before headless Chrome became the universal standard.

Ask a DataFlirt engineer →

TL;DR

Scrapy Splash routes Scrapy requests through a separate Splash HTTP API server. The Splash server renders the DOM, executes custom Lua scripts to interact with the page, and returns the final HTML back to the Scrapy spider. It's fast but runs an outdated WebKit engine that modern anti-bot systems flag instantly.

01Definition & structure

Scrapy Splash consists of two parts: a standalone HTTP API server (Splash) written in Python/Twisted/Qt, and a Scrapy middleware (scrapy-splash) that routes requests to it. Instead of Scrapy fetching a URL directly, it asks the Splash server to fetch it, render the JavaScript, and return the final HTML.

02How it works in practice

When a spider yields a SplashRequest, the middleware intercepts it and forwards it to the Splash server's /render.html or /execute endpoint. The Splash server loads the page in its Qt WebKit engine, waits for network idle or a specific timeout, and sends the rendered DOM back to Scrapy as a standard HtmlResponse, allowing the spider to parse it using normal CSS/XPath selectors.

03The Lua scripting bottleneck

To interact with a page (e.g., clicking a button or scrolling), developers had to write custom Lua scripts and send them to the Splash /execute endpoint. This created massive developer friction: Python engineers had to maintain complex Lua strings embedded inside their Python spiders, making debugging and syntax highlighting nearly impossible.

04How DataFlirt handles it

We don't use Splash. We migrated our last internal Splash cluster in 2023. Our Scrapy pipelines rely entirely on scrapy-playwright for JavaScript rendering. Playwright allows us to write interaction logic natively in Python async coroutines, supports modern Chromium/WebKit engines, and integrates seamlessly with our anti-bot fingerprinting stack.

05Did you know?

Splash was originally created by Scrapinghub (now Zyte) to solve the exact problem that headless Chrome eventually solved natively. Before Chrome 59 introduced headless mode in 2017, using a custom Qt WebKit wrapper like Splash was one of the only viable ways to render JavaScript at scale without running a full X11 virtual frame buffer.

// 03 — the performance model

Why Splash
was fast (and fragile).

Splash used a shared Qt event loop rather than spawning isolated browser processes. This made it highly concurrent but vulnerable to memory leaks and modern fingerprinting.

Splash concurrency limit = Workers = RAM_available / 300MB

Splash instances are memory-hungry and leak over time. Legacy infrastructure sizing

Render latency = T_total = T_network + T_{lua_exec} + T_{dom_ready}

Lua script execution adds overhead to standard DOM rendering. Splash execution model

DataFlirt migration ROI = Δ = (SuccessRate_playwright − SuccessRate_splash) / Cost_compute

Playwright costs 3x more compute but yields 99% fewer 403s. Internal migration audit, 2023

// 04 — splash lua script

Executing Lua
on the Splash server.

A standard Splash request involves sending a Lua script to the Splash API, which controls the WebKit instance before returning the HTML.

LuaSplash APILegacy WebKit

edge.dataflirt.io — live

CAPTURED

-- splash_request.lua
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(2.0))
-- modern anti-bots flag this user-agent instantly
splash:set_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64)...")
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end

// Scrapy log output
DEBUG: Crawled (200) <GET http://splash:8050/execute>
WARNING: Cloudflare challenge detected in rendered HTML
ERROR: Splash timeout (LUA_ERROR)

// 05 — deprecation drivers

Why pipelines
abandon Splash.

The primary failure modes that drive engineering teams to migrate legacy Scrapy Splash pipelines to modern CDP-based alternatives.

SPLASH USAGE · · · · < 2% of new pipelines

MIGRATION TARGET · · · Scrapy-Playwright

UPDATED · · · · · · 2026-05-19

01

Anti-bot detection

fatal · Outdated WebKit engine fails modern JS challenges

02

Memory leaks

operational · Long-running Splash instances require aggressive restarting

03

Lua script complexity

DX friction · Maintaining Lua scripts inside Python spiders is an anti-pattern

04

Lack of CDP support

technical · Cannot use modern Chrome DevTools Protocol features

05

Stagnant maintenance

ecosystem · Core project sees minimal updates compared to Playwright

// 06 — the migration path

Retiring Splash,

upgrading to Playwright without rewriting the spider.

DataFlirt migrated its last internal Splash cluster in 2023. The transition to scrapy-playwright requires swapping the middleware and replacing Lua scripts with Python async coroutines. While Playwright consumes more RAM per concurrent page, the reduction in anti-bot blocks and the elimination of Lua-related technical debt yields a massive net positive for pipeline reliability.

scrapy-playwright migration diff

Key configuration changes when moving a Scrapy project off Splash.

DOWNLOADER_MIDDLEWARES remove SplashAwareDownloader

DOWNLOAD_HANDLERS add ScrapyPlaywrightDownloadHandler

TWISTED_REACTOR AsyncioSelectorReactor

scripting_language LuaPython async

browser_engine Qt WebKitChromium / WebKit

anti_bot_pass_rate 12%98.5%

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about Scrapy Splash, its legacy status, and how to modernize JavaScript-heavy Scrapy pipelines.

Ask us directly →

Is Scrapy Splash officially deprecated? +

While the repository still exists, it is functionally stagnant. The Scrapy maintainers and the broader community have overwhelmingly shifted to scrapy-playwright for modern JavaScript rendering. Building a new pipeline on Splash today is strongly discouraged.

Why does Splash fail against Cloudflare and DataDome? +

Splash uses a custom, outdated Qt WebKit engine. Modern anti-bot systems check for specific Chromium or modern WebKit JavaScript APIs, canvas rendering quirks, and TLS fingerprints. Splash fails these checks instantly, resulting in immediate blocks.

Do I need to rewrite my entire Scrapy spider to migrate off Splash? +

No. The core Scrapy architecture (Items, Pipelines, Spiders) remains identical. You only need to change the Request yields to use playwright=True instead of meta={'splash': ...} and rewrite any custom Lua interaction scripts into Python async functions.

How did DataFlirt handle the compute cost increase when migrating off Splash? +

Playwright is heavier than Splash. We offset the compute cost by implementing selective rendering—only routing requests through Playwright if the target data is provably missing from the raw HTML response. This dropped our render volume by 60%, making the migration cost-neutral.

Can Splash handle modern Single Page Applications (SPAs)? +

Barely. While it can execute JavaScript, its outdated engine struggles with modern React, Vue, or Angular features, often throwing silent JavaScript errors that result in empty or partially rendered DOMs.

What is the difference between Splash and Selenium? +

Splash is an HTTP API built specifically for Scrapy, using Lua for scripting and a shared Qt event loop. Selenium is a general-purpose browser automation protocol (WebDriver) that spawns standalone browser processes. Both are largely superseded by Playwright for scraping.

$ dataflirt scope --new-project --target=scrapy-splash READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h