← Glossary / Scrapy Item Pipeline

What is Scrapy Item Pipeline?

Scrapy Item Pipeline is the sequential processing layer in the Scrapy framework where raw extracted records are cleansed, validated, deduplicated, and routed to storage. It acts as the critical boundary between the asynchronous fetch-and-parse engine and the downstream data sink. If you block the reactor thread here with synchronous database writes, your entire scraping fleet will grind to a halt.

ScrapyData CleaningETLAsync I/OValidation
// 02 — definitions

Cleanse, validate,
deliver.

The assembly line that takes raw dictionaries from your spiders and turns them into production-ready datasets.

Ask a DataFlirt engineer →

TL;DR

The Item Pipeline is a series of Python classes that receive scraped items one by one. Each class can modify the item, drop it if it fails validation, or write it to a database. Because Scrapy is built on Twisted, pipeline operations must be strictly non-blocking—otherwise, a slow database insert will stall the entire crawler.

01Definition & structure

A Scrapy Item Pipeline is a Python class that implements a process_item(self, item, spider) method. When a spider yields an item, it is sent to the pipeline for processing. Multiple pipeline classes can be chained together, executing in an order defined by their priority integer in the settings.py file.

Pipelines are typically used for:

  • Cleansing HTML data (e.g., stripping tags, normalizing whitespace)
  • Validating scraped data (checking for missing fields)
  • Checking for duplicates (and dropping them)
  • Storing the scraped item in a database or exporting to a file
02How it works in practice

As the spider parses responses, it yields dictionaries or Item objects. These items enter the pipeline chain. If a pipeline class successfully processes the item, it must return the item object so the next class in the chain can receive it. If the item is invalid, the class raises a DropItem exception, which halts further processing for that specific item.

03The Twisted Reactor constraint

Because Scrapy runs on the Twisted asynchronous networking framework, it operates on a single main thread (the reactor). If you write a pipeline that makes a synchronous blocking call—like a standard PostgreSQL insert or a time.sleep()—you block the entire reactor. No new requests are sent, and no responses are processed until that database insert finishes. To fix this, database pipelines must use Twisted's adbapi to run queries in a background thread pool, or return a Deferred.

04How DataFlirt handles it

We treat Scrapy purely as an extraction engine. Our item pipelines do almost zero processing. Instead, we use a custom asynchronous pipeline that serializes the raw item and pushes it to a Kafka topic. This guarantees the Scrapy reactor is never blocked by heavy validation logic or database latency. All data cleaning, schema enforcement, and storage routing happen in a separate, horizontally scalable stream processing cluster.

05Did you know: the open_spider and close_spider hooks

Pipeline classes can optionally implement open_spider(self, spider) and close_spider(self, spider) methods. These are executed exactly once when the spider starts and finishes. This is the correct place to open and close database connections or file handles, rather than opening a new connection for every single item processed.

// 03 — pipeline throughput

How fast can you
process items?

Pipeline throughput must exceed the spider's extraction rate, or memory queues will bloat until the process OOMs. DataFlirt monitors pipeline latency per item to prevent backpressure.

Pipeline Latency = Tpipe = Σ tcomponent + tio_wait
Total time an item spends traversing all enabled pipeline classes. Scrapy Architecture
Max Throughput = Rmax = 1 / Tpipe × CONCURRENT_ITEMS
Theoretical max items processed per second before queue buildup. Twisted Reactor Model
DataFlirt Drop Rate = D = items_dropped / items_scraped
Maintained < 0.02% across our production pipelines. Internal SLO
// 04 — pipeline execution trace

From raw dict
to validated record.

A live trace of a single item passing through a standard e-commerce pipeline: cleaning, schema validation, deduplication, and async database insertion.

Twisted DeferredSchema ValidationAsyncpg
edge.dataflirt.io — live
CAPTURED
// item yielded by spider
spider: "amazon_in_catalog"
item.raw_price: "₹ 1,299.00"

// PriceCleaningPipeline (priority: 100)
action: clean_price
item.price: 1299.00

// SchemaValidationPipeline (priority: 200)
check: "ProductSchema"
status: passed

// DuplicateFilterPipeline (priority: 300)
redis.sismember: "seen:B08L5TNJHG"
result: 0 // new item

// AsyncPostgresPipeline (priority: 800)
db.insert: "INSERT INTO products..."
latency: 12ms
pipeline.status: ITEM_PROCESSED
// 05 — pipeline bottlenecks

Where pipelines
choke and die.

The most common failure modes in Scrapy item pipelines, ranked by frequency across DataFlirt's managed infrastructure.

PIPELINES MONITORED ·   850+ active
AVG LATENCY ·  ·  ·  ·    < 5ms per item
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Synchronous DB inserts

~85% of stalls · Blocks the Twisted reactor thread
02

Memory leaks

OOM crashes · Keeping references to processed items
03

Slow external API calls

High latency · Enriching items synchronously
04

Regex CPU hogging

Reactor starvation · Complex text cleaning on large fields
05

Schema validation overhead

CPU bound · Heavy JSON schema checks per item
// 06 — DataFlirt's architecture

Decouple extraction,

scale processing independently.

While Scrapy's built-in pipelines are great for simple tasks, running heavy validation and database inserts inside the crawler process limits scalability. DataFlirt uses a lightweight Scrapy pipeline that simply pushes raw items to a Kafka topic asynchronously. All heavy lifting—schema validation, deduplication, and normalization—happens in a separate, horizontally scaled stream processing layer. This keeps the crawler fast, memory-efficient, and focused entirely on fetching and parsing.

KafkaProducerPipeline.py

Metrics from our decoupled pipeline architecture.

pipeline.class DataFlirtKafkaPipeline
reactor.status non-blocking
items.queued 14,200/sec
avg.latency 2.4ms
dropped.items 0
memory.profile stable

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About pipeline architecture, asynchronous I/O, validation strategies, and how DataFlirt scales Scrapy processing.

Ask us directly →
What happens if an item pipeline raises a DropItem exception? +
The item is immediately discarded and stops processing through any subsequent pipeline classes. This is the standard way to filter out invalid, incomplete, or duplicate records before they reach your database. Scrapy logs the drop, allowing you to monitor data quality issues.
Why is my Scrapy crawler slowing down over time? +
You are likely making synchronous blocking calls (like requests.get() or standard psycopg2 inserts) inside your pipeline. Scrapy is single-threaded and asynchronous; a blocking call in a pipeline halts the entire engine. Use Twisted's adbapi or async libraries like aiohttp and asyncpg.
How does DataFlirt handle schema changes in the pipeline? +
We don't validate inside the Scrapy process. Our Scrapy pipelines act as dumb forwarders to Kafka. Schema validation happens downstream in a dedicated microservice, allowing us to quarantine bad records without crashing or slowing down the crawler fleet.
Can I use pipelines to download images? +
Yes, Scrapy provides a built-in ImagesPipeline and FilesPipeline specifically for this. They asynchronously download media files associated with the scraped items and attach the local file paths back to the item dictionary, all without blocking the main reactor.
What is the ITEM_PIPELINES setting priority? +
It's an integer (typically 0-1000) that determines the execution order. Lower numbers run first. You should order them logically: clean data first (100), validate schema second (200), deduplicate third (300), and write to storage last (800).
Is it legal to deduplicate data in the pipeline? +
Deduplication itself is just a technical operation. However, ensuring you only store what you need aligns with the GDPR data minimization principle. Dropping PII or irrelevant fields early in the pipeline reduces your compliance risk surface and storage costs.
$ dataflirt scope --new-project --target=scrapy-item-pipeline READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h