← Glossary / Scrapy Middleware

What is Scrapy Middleware?

Scrapy middleware is a framework-level hook system that sits between Scrapy's engine and its downloader, allowing you to intercept, modify, or drop requests and responses in transit. It's the standard integration point for proxy rotation, user-agent spoofing, retry logic, and custom anti-bot bypass mechanisms. If you are building a production pipeline in Python, middleware is where your infrastructure logic lives, keeping your spider code strictly focused on data extraction.

ScrapyPythonRequest HooksProxy IntegrationAnti-bot
// 02 — definitions

Intercepting
the wire.

The architectural layer where infrastructure meets extraction, allowing you to manipulate HTTP traffic before it hits the network or the spider.

Ask a DataFlirt engineer →

TL;DR

Scrapy middleware provides a pipeline of callbacks executed on every request and response. Downloader middleware handles network-level concerns like proxies and headers, while spider middleware handles item processing and exception routing. It's the backbone of any scalable Scrapy deployment.

01Definition & structure
Scrapy middleware is a framework-level hook system that intercepts traffic. Downloader middleware sits between the Engine and the Downloader, processing Requests before they go to the internet and Responses before they return to the spider. Spider middleware sits between the Engine and the Spider, processing the Items and new Requests yielded by your extraction logic.
02How it works in practice
A downloader middleware class implements three main methods: process_request, process_response, and process_exception. When a request is scheduled, it passes through the process_request method of every active middleware in order. If any middleware returns a Response directly (e.g., from a cache), the rest of the chain is skipped, and the response is sent back up the chain through process_response.
03The order of operations (Weights)
Middleware execution order is determined by an integer weight defined in settings.py. Lower numbers execute first for requests, and last for responses. For example, a proxy middleware should have a lower weight than a retry middleware, ensuring that if a request fails and is retried, it gets assigned a fresh proxy on the next pass rather than reusing the burned one.
04How DataFlirt handles it
We decouple infrastructure from extraction. Our clients write pure Scrapy spiders focused solely on XPath and CSS selectors. At runtime, our orchestration layer injects a proprietary, compiled middleware stack that handles residential proxy routing, TLS fingerprinting, and automated CAPTCHA solving. The spider code remains clean, and the infrastructure logic runs at C-level speeds without blocking the Python event loop.
05The silent failure mode
Blocking the Twisted reactor. Because Scrapy is single-threaded, if you use a synchronous library like requests or psycopg2 inside a middleware hook to fetch a proxy or log an event, the entire scraper halts until that network call completes. Concurrency drops to zero. Always use Twisted's asynchronous APIs or deferToThread for I/O inside middleware.
// 03 — middleware latency

How much overhead
does middleware add?

Every request passes through the middleware chain twice (outbound and inbound). Heavy synchronous operations here will bottleneck the entire Twisted reactor, destroying your concurrency.

Total Middleware Overhead = T = Σ treq + Σ tres
Keep it under 5ms per request to maintain high concurrency. Scrapy Performance Tuning
Effective Concurrency = C = CONCURRENT_REQUESTS / (1 + T)
High middleware latency directly reduces active network connections. Twisted Reactor Model
DataFlirt Rust-Bridge Latency = L = 0.4 ms + (proxy_auth_time)
Offloading proxy signing to Rust keeps the Python event loop unblocked. Internal SLO
// 04 — middleware trace

A request's journey
through the chain.

Debug log of a single Scrapy request passing through a production downloader middleware stack, handling proxy assignment, header rotation, and a 403 retry.

process_requestprocess_responseTwisted reactor
edge.dataflirt.io — live
CAPTURED
// outbound request
[Engine] Routing <GET https://target.com/api/v1/data>
[Middleware: UserAgent] process_request: assigning Chrome 124
[Middleware: ProxyRotator] process_request: assigning IP 104.28.x.x
[Middleware: AuthSigner] process_request: injecting HMAC-SHA256

// network layer
[Downloader] Fetching <GET https://target.com/api/v1/data>
[Downloader] Received 403 Forbidden (45ms)

// inbound response
[Middleware: AntiBot] process_response: detected Cloudflare challenge
[Middleware: Retry] process_response: scheduling retry (attempt 1/3)

// retry loop
[Engine] Re-routing <GET https://target.com/api/v1/data>
[Middleware: ProxyRotator] process_request: rotating to clean IP 198.51.x.x
[Downloader] Fetching <GET https://target.com/api/v1/data>
[Downloader] Received 200 OK (120ms)
[Middleware: AntiBot] process_response: payload validated
[Spider] Parsing response...
// 05 — performance killers

Where middleware
chokes the pipeline.

Because Scrapy runs on Twisted's single-threaded event loop, blocking operations in middleware will stall all concurrent requests. These are the most common culprits across client pipelines.

PIPELINES MONITORED ·   300+ active
THROUGHPUT ·  ·  ·  ·  ·  100M req/day
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Synchronous DB/Redis calls

reactor blocked · Waiting on network I/O without async/await
02

Heavy regex on response bodies

CPU bound · Parsing HTML in downloader middleware instead of spider
03

Misconfigured retry loops

infinite loop · Retrying persistent 503s without exponential backoff
04

Memory leaks in stateful hooks

OOM risk · Storing request history in dicts without TTL
05

Conflicting middleware weights

logic error · Auth headers overwritten by default Scrapy middleware
// 06 — DataFlirt's architecture

Bypassing Python,

for infrastructure-level speed.

Standard Scrapy middleware is written in Python, which is fine for simple header injection but falls apart when you need to perform complex cryptographic signing or real-time proxy health scoring at 5,000 requests per second. DataFlirt replaces the standard Scrapy middleware stack with a custom Rust-based extension that hooks directly into the Twisted reactor. We offload proxy rotation, JA3 spoofing, and anti-bot token generation to compiled code, keeping the Python event loop entirely free for what it does best: executing your spider's extraction logic.

df_middleware_stack.rs

Live metrics from our Rust-backed Scrapy middleware.

module df_proxy_rotator
execution_time 0.12ms
proxy_pool_size 45,000 IPs
auth_signing ed25519_hmac
reactor_block_time 0.00ms
retry_rate 1.4%
status active

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About Scrapy middleware architecture, performance tuning, anti-bot integration, and how DataFlirt manages infrastructure logic at scale.

Ask us directly →
What is the difference between downloader and spider middleware? +
Downloader middleware intercepts requests before they are sent to the network and responses before they reach the spider. It's used for proxies, headers, and retries. Spider middleware intercepts the output of the spider (Items and Requests) and handles exceptions. If you are dealing with HTTP, use downloader middleware; if you are dealing with extracted data, use spider middleware.
Why is my Scrapy scraper running so slowly even with high concurrency? +
You are likely making synchronous calls in your middleware. Scrapy uses Twisted, an asynchronous event loop. If you use the standard requests library or a synchronous database driver inside process_request, the entire reactor thread blocks. Use Twisted's deferToThread or native async/await for any I/O in middleware.
Can I use Playwright or Puppeteer inside Scrapy middleware? +
Yes. Libraries like scrapy-playwright operate as downloader middleware. They intercept the Scrapy Request, route it to a headless browser instance instead of Scrapy's default HTTP downloader, and return the rendered HTML as a standard Scrapy Response object back to your spider.
How do I handle CAPTCHAs in middleware? +
Catch the 403 or CAPTCHA response in process_response. Instead of returning the response to the spider, pause the request, route the URL to a solver API asynchronously, and yield a new Request with the solved token attached. The spider remains completely unaware that a CAPTCHA ever existed.
How does DataFlirt manage middleware across hundreds of spiders? +
We don't put infrastructure logic in individual spider repositories. We inject our compiled Rust middleware via environment variables at container launch. This ensures uniform proxy routing, header rotation, and anti-bot handling across the entire fleet without requiring spider developers to maintain infrastructure code.
Is it legal to strip tracking headers in middleware? +
Yes. Modifying your own client's outbound headers — like removing tracking cookies, spoofing User-Agents, or altering referrers — is standard HTTP client behavior and generally lawful, provided you aren't using those modifications to bypass authenticated access controls or commit fraud.
$ dataflirt scope --new-project --target=scrapy-middleware READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h