← Glossary / Scrapy Settings

What is Scrapy Settings?

Scrapy settings form the configuration layer that dictates how a Scrapy spider interacts with the network, manages memory, and processes extracted items. Defined in settings.py or injected at runtime, these key-value pairs control everything from concurrency limits and download delays to middleware activation and retry policies. Misconfigure them, and your pipeline will either crawl too slowly to be useful or aggressively DDoS the target and get your IP pool permanently banned.

ScrapyConcurrencyConfigurationPythonTuning
// 02 — definitions

The crawler's
control plane.

Settings dictate the physics of your crawler. They define how fast it moves, how much memory it consumes, and how it reacts to failure.

Ask a DataFlirt engineer →

TL;DR

Scrapy settings are a dictionary of parameters that govern the framework's core components. They control the Downloader (concurrency, timeouts), the Scheduler (queues, priorities), and the Item Pipeline. In production, hardcoded settings are an anti-pattern; they must be injected dynamically based on target health, proxy pool size, and anti-bot sensitivity.

01Definition & structure

Scrapy settings are a dictionary of key-value pairs that configure the behavior of the Scrapy framework. They control everything from the core engine (concurrency, logging, memory limits) to the HTTP downloader (timeouts, retries, proxies) and the item pipeline (export formats, database connections).

Settings can be defined globally in a project's settings.py file, scoped to a specific spider using the custom_settings attribute, or injected dynamically via command-line arguments (-s KEY=VALUE) when starting the crawl process.

02Concurrency vs. Delay

The two most critical settings for pipeline throughput are CONCURRENT_REQUESTS and DOWNLOAD_DELAY. They interact directly:

  • CONCURRENT_REQUESTS defines the maximum number of in-flight HTTP requests the Twisted reactor will manage at once.
  • DOWNLOAD_DELAY forces the engine to wait a specified number of seconds before dispatching the next request to the same domain.

If you set a delay of 1.0 seconds, your maximum throughput to that domain is exactly 1 request per second, regardless of how high you set your concurrency limits.

03Middleware and Pipeline activation

Scrapy uses settings dictionaries to activate and order its processing hooks. DOWNLOADER_MIDDLEWARES, SPIDER_MIDDLEWARES, and ITEM_PIPELINES are dictionaries where the key is the class path and the value is an integer representing the execution order (lower numbers execute closer to the engine, higher numbers closer to the network/output).

To disable a built-in middleware (like the default UserAgentMiddleware), you must explicitly declare it in your settings and assign it a value of None.

04How DataFlirt handles it

We treat Scrapy settings as ephemeral state. Our spiders contain zero hardcoded configuration. When our orchestration layer spins up a Scrapy worker, it queries a central Redis configuration store for the target's current profile. It then injects the optimal CONCURRENT_REQUESTS, proxy rotation rules, and retry policies via the Scrapyd API.

If our monitoring detects an uptick in 403 Forbidden responses, the control plane updates the settings payload. The next worker to spin up automatically inherits the more conservative crawl rate, preventing cascading IP bans across the fleet.

05The DOWNLOAD_TIMEOUT trap

Scrapy's default DOWNLOAD_TIMEOUT is 180 seconds. In a high-concurrency scraping environment using residential proxies, this is disastrous. Residential nodes frequently drop connections silently. If you have a concurrency limit of 32, and 32 proxy requests hang, your spider will sit completely idle for 3 minutes waiting for timeouts.

Production pipelines should aggressively lower this to 10–15 seconds. It is always faster to fail quickly, drop the bad proxy, and retry the request than to wait for a dead connection to resolve.

// 03 — tuning physics

Calculating
throughput limits.

Scrapy's actual request rate is a function of concurrency settings, network latency, and explicit delays. DataFlirt's scheduler calculates these dynamically per target to maximize throughput without triggering rate limits.

Theoretical max throughput = Rmax = CONCURRENT_REQUESTS / Latencyavg
Assumes DOWNLOAD_DELAY is 0. Bound by Twisted reactor CPU usage. Scrapy Downloader mechanics
AutoThrottle target delay = Delay = Latencyavg / AUTOTHROTTLE_TARGET_CONCURRENCY
AutoThrottle adjusts delays dynamically based on rolling latency averages. scrapy.extensions.autothrottle
Memory usage limit = Max_Bytes = MEMUSAGE_LIMIT_MB × 1024²
Triggers a graceful shutdown if the spider leaks memory beyond this threshold. scrapy.extensions.memusage
// 04 — startup trace

Injecting settings
at runtime.

A DataFlirt worker initializing a Scrapy spider. Settings are pulled from the central registry, overriding the project defaults to match the target's current anti-bot threshold.

Scrapy 2.11dynamic configTwisted reactor
edge.dataflirt.io — live
CAPTURED
// fetching dynamic config for target: ecom_in_04
config.source: "redis://control-plane:6379/settings"
overrides.applied: 14

// scrapy startup sequence
[scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: dataflirt_worker)
[scrapy.crawler] INFO: Overridden settings:
{'CONCURRENT_REQUESTS': 32,
'CONCURRENT_REQUESTS_PER_DOMAIN': 8,
'DOWNLOAD_DELAY': 0.25,
'COOKIES_ENABLED': False,
'RETRY_TIMES': 5,
'DOWNLOADER_MIDDLEWARES': {...}}

// middleware initialization
[scrapy.middleware] INFO: Enabled downloader middlewares:
['dataflirt.middlewares.ProxyRotationMiddleware',
'dataflirt.middlewares.JA3SpoofingMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware']

spider.status: running reactor: epoll
// 05 — impact radius

Settings that break
pipelines.

Ranked by their contribution to pipeline failures, timeouts, and target bans across our managed Scrapy fleet. Misunderstanding concurrency and cookie state accounts for the vast majority of issues.

PIPELINES ANALYZED ·  ·   1,200+
FRAMEWORK ·  ·  ·  ·  ·   Scrapy 2.x
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

CONCURRENT_REQUESTS

rate limit trigger · Setting this too high without proxy rotation guarantees IP bans.
02

COOKIES_ENABLED

state leakage · Default is True. Leaks session state across requests, ruining proxy anonymity.
03

DOWNLOAD_TIMEOUT

worker starvation · Default 180s is too long. Slow proxies will tie up the Twisted reactor.
04

RETRY_HTTP_CODES

infinite loops · Retrying 403s without rotating fingerprints wastes bandwidth and gets you blocked harder.
05

MEMUSAGE_LIMIT_MB

OOM crashes · Failing to set this allows memory leaks to kill the entire container.
// 06 — fleet configuration

No hardcoded settings,

configuration as a dynamic state.

In a distributed scraping fleet, a static settings.py is a liability. Target latency fluctuates, proxy pools degrade, and anti-bot thresholds shift hourly. DataFlirt treats Scrapy settings as a dynamic payload injected at spider startup. If a target starts throwing 429 Too Many Requests, our control plane automatically reduces CONCURRENT_REQUESTS_PER_DOMAIN and pushes the updated config to all active workers without restarting the spiders. We decouple the spider logic from the operational physics.

worker-config-payload.json

Runtime settings injected into a Scrapy worker via our orchestration API.

CONCURRENT_REQUESTS 64high throughput
DOWNLOAD_DELAY 0.0proxy-managed
COOKIES_ENABLED Falsestateless
DOWNLOAD_TIMEOUT 15fail fast
RETRY_TIMES 3
HTTPCACHE_ENABLED False
AUTOTHROTTLE_ENABLED Falsedisabled

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about Scrapy settings resolution, concurrency tuning, and managing configuration at scale.

Ask us directly →
What is the resolution order for Scrapy settings? +
Scrapy resolves settings in a strict hierarchy, from lowest to highest priority: Default global settings → Project settings.py → Spider-specific custom_settings → Command line arguments (-s). Command line arguments always win, which is how orchestrators inject dynamic configs at runtime.
Why should I disable COOKIES_ENABLED when using proxies? +
By default, Scrapy tracks cookies across requests. If you route requests through a rotating proxy pool, the target server will see the same session cookie coming from 50 different IP addresses. This is a massive red flag for anti-bot systems. Set COOKIES_ENABLED = False to ensure each request is truly stateless.
Is the AutoThrottle extension good for production scraping? +
Usually, no. AutoThrottle adjusts DOWNLOAD_DELAY based on the latency of the target server. If you are using a rotating residential proxy pool, the latency fluctuates wildly based on the proxy node, not the target server. AutoThrottle misinterprets slow proxies as a struggling target and throttles your crawl to a crawl. We disable it and manage concurrency explicitly.
How do I fix memory leaks in long-running Scrapy spiders? +
First, set JOBDIR to persist state to disk instead of keeping the request queue in memory. Second, configure MEMUSAGE_LIMIT_MB to gracefully shut down the spider before the OS OOM-killer terminates the process. Finally, check your custom middleware — storing references to Response objects is the most common cause of Scrapy memory leaks.
How does DataFlirt tune CONCURRENT_REQUESTS? +
We don't guess. We run a calibration phase for new targets, starting at 2 concurrent requests and ramping up until we hit a 5% failure rate (timeouts or 429s). We then back off by 20% and lock that as the dynamic CONCURRENT_REQUESTS_PER_DOMAIN for that specific target. This is recalculated weekly.
What is the difference between CONCURRENT_REQUESTS and CONCURRENT_REQUESTS_PER_DOMAIN? +
CONCURRENT_REQUESTS is the absolute maximum number of active requests the Scrapy engine will process simultaneously across all domains. CONCURRENT_REQUESTS_PER_DOMAIN limits how many of those requests can hit a single specific domain. If you are crawling one site, the domain limit is the bottleneck.
$ dataflirt scope --new-project --target=scrapy-settings READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h