← Glossary / Batch Processing

What is Batch Processing?

Batch processing is the execution of data extraction, transformation, and load operations on a bounded, finite set of records at scheduled intervals, rather than continuously. In scraping infrastructure, it is the standard paradigm for catalog crawls, historical backfills, and daily pricing updates where completeness and consistency matter more than sub-second latency. Getting batch architecture wrong leads to memory exhaustion, silent data drops, and pipelines that fail to complete before the next scheduled run.

Data EngineeringETLSchedulingThroughputAirflow
// 02 — definitions

Bounded data,
scheduled runs.

Why the vast majority of commercial web scraping pipelines operate on a schedule rather than a continuous stream.

Ask a DataFlirt engineer →

TL;DR

Batch processing groups scraping and extraction tasks into discrete jobs that run at fixed intervals (e.g., daily at midnight). It optimizes for throughput and resource efficiency over latency. For 95% of data buyers — who ingest data into data warehouses like Snowflake or BigQuery — batch delivery is the most stable and cost-effective integration pattern.

01Definition & structure
Batch processing in the context of web scraping means executing a pipeline over a known, finite list of targets at a specific time. A batch job has a distinct start and end. It typically involves three phases: Discovery (finding all URLs to scrape), Fetch & Extract (downloading HTML and parsing fields), and Load (writing the final dataset to storage). Because the dataset is bounded, you can calculate completion percentages, guarantee deduplication, and validate the entire dataset's schema before delivering it to the client.
02How it works in practice
A scheduler (like Apache Airflow or a cron job) triggers the pipeline at a set interval — say, 02:00 UTC daily. The orchestrator spins up a cluster of worker nodes. A central queue distributes the URLs. The workers fetch the pages, extract the data, and write intermediate results to a temporary store. Once the queue is empty, a final aggregation step deduplicates the records, runs schema validation, and writes the output as a partitioned Parquet file to an S3 bucket. The workers are then spun down to save compute costs.
03The idempotency requirement
Batch jobs must be idempotent. If a job processing 5 million URLs crashes at 4.9 million, you cannot afford to start over — both for compute costs and because hitting the target site with another 4.9 million requests risks an IP ban. Idempotency is achieved by maintaining a persistent state of successfully processed URLs (often in Redis) and ensuring that writing the same extracted record twice does not corrupt the final dataset.
04How DataFlirt handles it
We treat batch pipelines as mission-critical ETL workloads. Our orchestrator dynamically scales worker concurrency based on the target's response times and our SLA window. If a target site slows down, we don't just blindly push more requests and cause a DoS; we throttle back and alert our on-call engineers if the SLA margin drops below 15%. Every batch delivery is atomic — clients never see partial or mid-run data.
05The memory exhaustion trap
The most common mistake junior engineers make when writing batch scrapers is accumulating the extracted data in memory (e.g., appending to a Python list) and writing it to disk only at the end of the script. This works for 10,000 records but causes an Out-Of-Memory (OOM) crash at 1,000,000 records. Production batch jobs stream their output to disk or a database continuously, keeping the worker's memory footprint flat regardless of the batch size.
// 03 — batch metrics

How to size
a batch job.

Batch pipelines are constrained by the processing window. If a daily job takes 25 hours to run, the pipeline is fundamentally broken. DataFlirt models concurrency to guarantee completion within the SLA.

Required Concurrency = C = (Total_URLs / Target_Window_Seconds) / Effective_RPS_per_Worker
Determines how many parallel workers are needed to finish the batch on time. DataFlirt capacity planner
Batch Completion Time = T = Overhead + (N / (Workers × Throughput))
Total time from job trigger to final S3 delivery. Standard ETL model
DataFlirt SLA Margin = M = 1 − (P99_Run_Time / SLA_Window)
We target M > 0.3 to absorb target site latency spikes or proxy retries. Internal SLO
// 04 — batch execution trace

A 2M-record daily
batch run.

Trace of a scheduled Airflow DAG executing a daily catalog extraction job. Shows discovery, distributed fetching, and final aggregation.

Airflow DAGDistributed workersS3 Export
edge.dataflirt.io — live
CAPTURED
// [00:00:00] trigger: schedule_daily_catalog
dag_id: "batch_ecommerce_in_v4"
task: "sitemap_discovery" SUCCESS
urls_queued: 2,145,890

// [00:05:12] task: distributed_fetch
workers_provisioned: 150
proxy_pool: "residential_IN"
progress: 45% // ETA: 2h 14m
retries.429_too_many_requests: 1,204 // handled by backoff

// [02:21:45] task: extract_and_validate
records.extracted: 2,142,011
records.quarantined: 3,879 // schema validation failed

// [02:35:10] task: load_to_s3
file_format: "parquet"
partitioning: "date=2026-05-19"
status: COMPLETED // SLA met (margin: 42%)
// 05 — failure modes

Why batch jobs
fail to finish.

Ranked by frequency across DataFlirt's historical incident logs. In batch processing, a job that runs out of memory at 99% completion is a total failure.

PIPELINES ·  ·  ·  ·  ·   300+ active
JOBS/DAY ·  ·  ·  ·  ·    1,200+
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Target site rate limiting

causes massive retry loops ·
02

Worker Out-Of-Memory (OOM)

accumulating state in memory ·
03

Schema drift during run

quarantine queue overflows ·
04

Proxy pool exhaustion

concurrency starves ·
05

Upstream site downtime

maintenance windows overlap ·
// 06 — DataFlirt's batch architecture

Stateless workers,

idempotent tasks, guaranteed delivery.

DataFlirt builds batch pipelines on a strictly decoupled architecture. URL discovery, fetching, extraction, and delivery are isolated steps connected by persistent message queues. If a worker dies mid-run, the task is safely reassigned. If the target site goes down for an hour, the job pauses and resumes. We never hold the entire dataset in memory, allowing us to scale batch jobs to billions of records without vertical scaling bottlenecks.

Batch Job Telemetry

Live metrics from a high-volume real estate batch pipeline.

job.id batch-re-us-092
state RUNNING
records.processed 14,205,112
memory.per_worker 412 MB
dead_letter_queue 0
throughput 1,850 req/s
sla.status on track

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About batch architecture, consistency, idempotency, and how DataFlirt guarantees delivery at scale.

Ask us directly →
What is the difference between batch processing and stream processing in scraping? +
Batch processing handles a bounded dataset at scheduled intervals (e.g., scraping 100k products every night). Stream processing handles unbounded data continuously (e.g., listening to a WebSocket for live sports scores). Batch optimizes for throughput and cost; stream optimizes for latency.
How do you handle a target site updating its data while a 5-hour batch job is running? +
This is known as read consistency skew. For most use cases, the skew is acceptable. If strict point-in-time consistency is required, we use snapshot isolation techniques — such as capturing the site's internal version IDs or relying on historical sitemap timestamps — to ensure the extracted batch represents a single logical state.
Is it legal to scrape millions of records in a single batch? +
Volume alone does not dictate legality. Accessing publicly available data is generally lawful, provided the batch job respects the target's infrastructure. We enforce strict concurrency limits and respect robots.txt Crawl-delay directives to ensure our batch jobs do not constitute a Denial of Service (DoS) or trespass to chattels.
What happens if a batch job fails halfway through? +
DataFlirt pipelines are idempotent. We track state in a persistent queue (like Redis or Kafka). If a job fails at 50%, the restart doesn't begin from zero — it resumes exactly where it left off. This prevents duplicate requests to the target and guarantees we meet delivery SLAs even if infrastructure hiccups occur.
Why not just run the scraper continuously instead of in batches? +
Cost and target impact. Continuous crawling requires maintaining persistent infrastructure and constantly hitting the target site, which increases proxy costs and detection risk. Batching allows us to spin up 500 workers, extract the data in two hours, and spin them down, delivering a clean, deduplicated dataset to your warehouse.
How does DataFlirt deliver the final batch data? +
We typically write the extracted records to an object store (Amazon S3, GCS) in columnar formats like Parquet or ORC, partitioned by date. Once the batch is complete and schema validation passes, we trigger a webhook or an Airflow sensor in your environment to initiate your downstream ETL processes.
$ dataflirt scope --new-project --target=batch-processing READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h