← Glossary / Scrapy Spider

What is Scrapy Spider?

Scrapy Spider is the core Python class in the Scrapy framework that defines how a specific site should be crawled and parsed. It dictates the initial requests, follows pagination links, and extracts structured data from the response. When a spider is poorly written, it blocks the reactor thread, leaks memory, and turns a high-throughput async pipeline into a fragile, single-threaded bottleneck.

PythonAsyncCrawlingScrapyETL
// 02 — definitions

The extraction
engine.

The logic layer where URL discovery meets data extraction, running asynchronously on top of Twisted.

Ask a DataFlirt engineer →

TL;DR

A Scrapy Spider is a Python class that yields Requests and Items. It is the only part of the Scrapy framework that requires site-specific logic. Everything else — concurrency, retries, proxy rotation, and export — is handled by the engine and middlewares, allowing the spider to focus purely on parsing.

01Definition & structure
A spider inherits from scrapy.Spider. It requires a name, start_urls (or a start_requests method), and a parse callback. The parse method receives an HTTP response, extracts data using XPath or CSS selectors, and yields either a dictionary/Item (the data) or a new scrapy.Request (the next page to crawl).
02How it works in practice
When you run a spider, the Scrapy engine schedules the initial requests. As responses return, they are routed back to the spider's callbacks. Because Scrapy uses Twisted for asynchronous I/O, a single spider can handle thousands of concurrent requests without threading overhead, provided you don't write blocking code in the parse method.
03The blocking code trap
The most common way engineers break Scrapy spiders is by putting synchronous operations inside the callback. Making a requests.get() call, executing a heavy pandas transform, or running a synchronous database insert inside parse() will block the entire Twisted reactor, dropping your throughput from 500 req/s to 1 req/s.
04How DataFlirt handles it
We maintain a library of over 4,000 active Scrapy spiders. To manage this scale, our spiders are strictly stateless and contain zero delivery logic. They yield raw items to our Kafka-backed item pipelines. Proxy rotation, fingerprinting, and retries are entirely abstracted into custom middlewares, keeping the spider code under 100 lines on average.
05Did you know: CrawlSpider pitfalls
The CrawlSpider subclass is often misused. It uses a ruleset to automatically follow links, which seems convenient for broad crawls. But for targeted data extraction, it obscures the crawl path and makes debugging pagination failures significantly harder. We mandate the base scrapy.Spider for all production pipelines.
// 03 — spider throughput

What limits a
spider's speed?

A spider's throughput is a function of network latency, concurrency settings, and CPU time spent in the parse callback. DataFlirt monitors callback latency to prevent reactor blocking.

Theoretical Throughput = CONCURRENT_REQUESTS / Avg_Latency
Max requests per second assuming zero CPU overhead. Scrapy Architecture
Callback Latency = TendTstart
Time spent executing the parse method. Must stay under 10ms. Twisted Reactor Limits
Memory Footprint = Base_RAM + (Queue_Size × Req_Size)
Unbounded queues will OOM the spider process. DataFlirt Infrastructure SLO
// 04 — spider execution log

A spider's lifecycle
in production.

Standard Scrapy log output for a targeted e-commerce spider, showing the async engine handling discovery, extraction, and item routing.

Twisted reactorasync I/Oitem pipeline
edge.dataflirt.io — live
CAPTURED
2026-05-19 14:22:01 [scrapy.utils.log] INFO: Scrapy 2.11.1 started
2026-05-19 14:22:01 [scrapy.core.engine] INFO: Spider opened
spider_name: "in_amazon_laptops"
2026-05-19 14:22:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://target.com/laptops>
callback: "parse_listing" latency: 142ms
2026-05-19 14:22:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://target.com/laptops>
item.sku: "B09X6789" item.price: 84999
2026-05-19 14:22:02 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://target.com/laptops?page=2>
middleware: "DataFlirtProxyRetry" action: "rotating session, retrying"
2026-05-19 14:22:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://target.com/laptops?page=2>
callback: "parse_listing" latency: 188ms
2026-05-19 14:25:10 [scrapy.core.engine] INFO: Closing spider (finished)
item_scraped_count: 14,205
request_count: 612
finish_reason: "finished"
// 05 — failure modes

Why spiders crash
and burn.

Scrapy is incredibly stable, but the code written inside the spider often isn't. These are the most common reasons a spider fails in production, ranked by frequency across our fleet.

SPIDERS MONITORED ·  ·    4,200+ active
FRAMEWORK ·  ·  ·  ·  ·   Scrapy 2.11
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Selector Rot

DOM changes · Site structure updates break XPath/CSS extraction
02

Reactor Blocking

Sync code · Using requests.get or heavy CPU tasks in callbacks
03

Memory Leaks

Unbounded queues · Yielding too many requests without yielding items
04

Unhandled Exceptions

Missing try/except · A single bad field parse crashes the callback
05

Infinite Pagination

Logic error · Spider fails to detect the last page and loops
// 06 — DataFlirt's spider architecture

Dumb spiders,

smart infrastructure.

At DataFlirt, we treat Scrapy spiders as disposable parsing functions. They do not know about proxies, they do not handle retries, and they do not write to databases. All complex logic is pushed down into the middlewares and item pipelines. This strict separation of concerns allows us to deploy thousands of spiders that are easy to read, trivial to test, and highly resilient to infrastructure changes.

spider-runtime-context

Live context of a DataFlirt spider running in our Kubernetes cluster.

spider.class DataFlirtBaseSpider
middleware.proxy enabled · residential_pool
middleware.fingerprint enabled · ja3_spoofing
pipeline.kafka connected · topic: raw_items
reactor.latency 4ms avg
memory.usage 142 MB
status running

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About Scrapy spiders, async programming, memory management, and how DataFlirt scales extraction.

Ask us directly →
What is the difference between a Scrapy Spider and a Scrapy Project? +
A project is the entire directory structure containing settings, middlewares, pipelines, and spiders. A spider is just one Python class inside that project that defines how to scrape a specific target. You can have hundreds of spiders in a single project sharing the same infrastructure.
Can I use Selenium or Playwright inside a Scrapy spider? +
Yes, but you shouldn't run them synchronously inside the parse method, or you'll block the reactor. Use scrapy-playwright or a custom downloader middleware to handle browser automation asynchronously, allowing the spider to remain fast and non-blocking.
Is it legal to run high-concurrency spiders against public sites? +
Concurrency itself isn't illegal, but ignoring a site's robots.txt Crawl-delay or causing a denial of service can lead to legal action under statutes like the CFAA or equivalent laws. Always tune your CONCURRENT_REQUESTS_PER_DOMAIN to respect the target's infrastructure.
Why does my spider consume gigabytes of RAM and crash? +
You likely have a memory leak caused by an unbounded request queue. If your spider yields requests much faster than the downloader can process them, the scheduler queue grows infinitely. Tune CONCURRENT_REQUESTS and use a priority queue or job pauses to manage backpressure.
How does DataFlirt deploy and scale Scrapy spiders? +
We containerize our Scrapy projects and deploy them on Kubernetes. Instead of running one massive spider, we partition the workload (e.g., by category or region) and spin up dozens of identical spider pods, coordinating the crawl state via Redis using a custom scrapy-redis implementation.
Should I use XPath or CSS selectors in my spider? +
Scrapy supports both natively. CSS selectors are generally easier to read for simple extractions, but XPath is far more powerful for complex DOM traversal (e.g., selecting an element based on the text of its sibling). Under the hood, Scrapy converts CSS to XPath anyway using cssselect.
$ dataflirt scope --new-project --target=scrapy-spider READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h