← Glossary / Scrapy Feed Export

What is Scrapy Feed Export?

Scrapy Feed Export is the built-in serialization and delivery mechanism within the Scrapy framework that writes extracted item records to persistent storage. Instead of writing custom I/O logic for every spider, feed exports allow you to map Python dictionaries to JSON, CSV, or Parquet files and push them directly to S3, GCS, or local disk via configuration alone. It is the standard boundary where scraping ends and data engineering begins.

ScrapyData DeliverySerializationS3 IntegrationETL
// 02 — definitions

From memory
to disk.

The configuration-driven layer that serializes scraped items and routes them to cloud storage without custom pipeline code.

Ask a DataFlirt engineer →

TL;DR

Scrapy Feed Export handles the serialization (JSON, CSV, XML) and storage (S3, FTP, local) of scraped items. It decouples extraction logic from delivery logic, allowing you to change output formats or destinations by modifying a single settings dictionary rather than rewriting spider code.

01Definition & structure
Scrapy Feed Export is an out-of-the-box extension that serializes scraped items and stores them. It is configured entirely via the FEEDS dictionary in your settings.py. A feed definition requires two things: a URI (where to save it, e.g., s3://my-bucket/data.json) and a format (how to serialize it, e.g., json, csv). It supports local file systems, FTP, S3, and Google Cloud Storage natively.
02How it works in practice
When a spider yields an item, it passes through the Item Pipelines. Once all pipelines finish processing the item, it is handed to the Feed Export extension. The extension uses an Exporter (like JsonLinesItemExporter) to convert the Python dictionary into bytes, and a Storage Backend (like S3FeedStorage) to write those bytes to the destination. This happens asynchronously, meaning slow disk I/O or network uploads won't block the spider from fetching more pages.
03Feed Export vs Item Pipelines
A common anti-pattern is writing custom Item Pipelines to save files to S3 or write CSVs. This reinvents the wheel. Pipelines should be used for mutation (cleaning text, dropping invalid records, looking up foreign keys). Feed Exports should be used for delivery. Separating these concerns means you can switch your output from a local CSV to an S3 JSON Lines file just by changing one line of config, without touching your Python code.
04How DataFlirt handles it
We rely heavily on Scrapy's Feed Export architecture, but we extend it. We maintain custom Exporters for Parquet and Avro to support our enterprise data lake clients. We also use custom Storage Backends that stream items directly into Kafka topics or Snowflake stages, bypassing intermediate file storage entirely. This allows us to offer sub-minute data latency on high-frequency pricing pipelines.
05Did you know?
Since Scrapy 2.1, Feed Exports support batching. By setting batch_item_count, Scrapy will automatically close the current file and start a new one once the threshold is reached. This is critical for long-running or continuous spiders, as it allows downstream ETL processes to start ingesting data chunks while the spider is still running, rather than waiting days for a single massive file to close.
// 03 — feed performance

How fast can
you serialize?

Feed export performance is bounded by disk I/O, network egress to cloud storage, and the serialization overhead of the chosen format. DataFlirt monitors these metrics to prevent the export layer from bottlenecking the crawl.

Serialization Latency = L = items × Tserialize
JSON Lines is fast and streams; XML is slow and CPU-heavy. Scrapy Exporter Profiling
Batch Upload Threshold = B = min(max_items, max_bytes)
Scrapy 2.1+ batching logic triggers an upload when either limit is hit. Scrapy FEEDS configuration
DataFlirt Delivery SLO = Tdelivery = Tscrape + 45s
Time from spider close to S3 object availability across our managed fleet. Internal SLO
// 04 — feed export trace

Flushing 50k items
to an S3 bucket.

A live trace of a Scrapy spider closing and the Feed Export extension flushing the final batch of JSON Lines to AWS S3.

Scrapy 2.11JSON LinesS3 Boto3
edge.dataflirt.io — live
CAPTURED
// spider closing sequence
[scrapy.core.engine] INFO: Closing spider (finished)
[scrapy.extensions.feedexport] INFO: Feed export batch started

// serialization and storage
[feedexport] DEBUG: Storing jsonlines feed (50000 items)
[botocore.credentials] DEBUG: Found credentials in environment
[s3_storage] INFO: Uploading s3://df-client-042/raw/2026-05-19/batch_01.jl
[s3_storage] DEBUG: Part 1 uploaded (5.2 MB)
[s3_storage] DEBUG: Part 2 uploaded (5.2 MB)
[s3_storage] DEBUG: Part 3 uploaded (1.4 MB)

// stats collection
[scrapy.statscollectors] INFO: Dumping Scrapy stats:
item_scraped_count: 50000
feedexport/success_count/S3FeedStorage: 1
feedexport/byte_count: 12384912
[scrapy.core.engine] INFO: Spider closed (finished)
// 05 — serialization formats

Choosing the right
export format.

Ranked by adoption across DataFlirt's managed Scrapy fleet. JSON Lines dominates due to its streaming nature and compatibility with modern data lakes.

PIPELINES ·  ·  ·  ·  ·   300+ active
VOLUME ·  ·  ·  ·  ·  ·   4B+ records/mo
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

JSON Lines (.jl / .jsonl)

streaming · Append-only, lake-ready, low memory footprint
02

CSV

flat schema · Legacy systems, poor nested data support
03

Parquet (Custom)

columnar · Analytics-ready, highly compressed, requires custom exporter
04

JSON (Array)

memory heavy · Requires full memory load, bad for large scrapes
05

XML

legacy · Enterprise integrations, high serialization overhead
// 06 — custom storage backends

Beyond the default,

streaming directly to the data warehouse.

While Scrapy's native S3 and FTP backends are excellent for file-based delivery, modern pipelines often require streaming inserts. DataFlirt extends the Feed Export architecture with custom storage backends that buffer items in memory and flush them directly to Snowflake, BigQuery, or Kafka topics. This bypasses the intermediate file stage entirely, reducing end-to-end latency from minutes to milliseconds.

FEEDS configuration

A production Scrapy settings dictionary mapping outputs to multiple destinations.

s3://bucket/raw/%(name)s.jl format: jsonlinesprimary
s3://bucket/raw/%(name)s.csv format: csvfields: [id, price]
batch_item_count 50000memory limit
overwrite Falseappend mode
store_empty Falsedrop empty runs
urllib.parse.quote enabledsafe URIs

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About Scrapy Feed Exports, serialization formats, cloud storage integration, and how DataFlirt manages data delivery at scale.

Ask us directly →
What is the difference between Feed Exports and Item Pipelines? +
Item Pipelines are for processing: cleaning data, dropping duplicates, validating schemas, or making database lookups per item. Feed Exports are strictly for serialization and storage: taking the final, processed item and writing it to a file (JSON, CSV) on a storage backend (S3, local disk).
How do I export to multiple destinations at once? +
In Scrapy 2.1+, the FEEDS setting is a dictionary. You can define multiple URIs as keys, each with its own format and configuration. Scrapy will concurrently serialize and upload the items to all defined destinations during the crawl.
Why is my JSON export consuming all my RAM? +
Standard JSON format requires building a single, valid JSON array. Scrapy has to keep the file handle open and manage the array structure, which can cause memory bloat on massive crawls. Switch your format to jsonlines (.jl). It writes one independent JSON object per line, allowing the OS to stream it to disk with near-zero memory overhead.
Does Scrapy support Parquet natively? +
No. Scrapy natively supports JSON, JSON Lines, CSV, and XML. To export to Parquet, you must write a custom BaseItemExporter that uses pandas or pyarrow to buffer items into columnar chunks and write them to disk. We use custom Parquet exporters extensively for our analytics-ready data feeds.
How does DataFlirt handle failed S3 uploads? +
Network blips happen. We wrap Scrapy's native S3 storage backend with custom retry logic, exponential backoff, and a dead-letter queue. If an S3 multipart upload fails after 5 retries, the batch is written to a persistent local volume and an alert is fired for asynchronous recovery. No scraped data is ever dropped due to a delivery failure.
Can I partition exports by date or category? +
Yes. Feed URIs support dynamic parameters. You can use %(time)s to append timestamps, %(name)s for the spider name, or even custom spider attributes like %(category)s. This allows a single spider to automatically partition its output into Hive-style directory structures on S3.
$ dataflirt scope --new-project --target=scrapy-feed-export READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h