← Glossary / Micro-Batch Processing

What is Micro-Batch Processing?

Micro-batch processing is a data ingestion architecture that processes incoming records in small, frequent chunks—typically every few seconds or minutes—rather than streaming them individually or waiting for a massive nightly run. For scraping pipelines, it bridges the gap between the high cost of true real-time streaming and the unacceptable latency of daily batch jobs, allowing downstream systems to consume fresh data while maintaining the efficiency of bulk inserts.

Data EngineeringETLLatencyThroughputPipeline Architecture
// 02 — definitions

Streaming speed,
batch economics.

The architectural compromise that powers 90% of modern data pipelines, balancing data freshness with database write efficiency.

Ask a DataFlirt engineer →

TL;DR

Micro-batching groups incoming scraped records into small time-windows (e.g., 60 seconds) or size thresholds (e.g., 5,000 records) before processing and writing them to the warehouse. It provides near real-time latency while avoiding the massive I/O overhead and connection exhaustion caused by single-record inserts.

01Definition & structure

Micro-batch processing is an ingestion strategy where incoming data is collected into small groups based on a time window (e.g., every 30 seconds) or a volume threshold (e.g., every 10,000 records). Once the threshold is met, the group is processed, transformed, and loaded into the destination as a single atomic unit.

It sits between traditional batch processing (high latency, high efficiency) and true stream processing (low latency, high overhead). For web scraping, where data arrives continuously but doesn't strictly require sub-second delivery, it is the optimal architecture.

02How it works in practice

As scraper workers extract data, they push individual JSON records to a message broker like Kafka. A consumer application reads from this broker, holding records in memory. When the 60-second timer pops, the consumer stops reading, applies schema validation and deduplication to the collected records, writes them to an S3 bucket as a Parquet file, and triggers a COPY INTO command on the data warehouse. It then commits the Kafka offset and starts the next window.

03Micro-batch vs. True Streaming

True streaming processes event N before event N+1 arrives. It requires complex state management and is highly sensitive to out-of-order data. Micro-batching treats a 1-minute window of streaming data as a tiny, static dataset. This allows you to use standard batch-processing logic (like SQL joins and aggregations) on streaming data, drastically simplifying the pipeline code at the cost of a few seconds of latency.

04How DataFlirt handles it

We use adaptive micro-batching across our delivery infrastructure. Instead of rigid time windows, our workers monitor the backpressure from the client's destination sink. If a client's Snowflake instance is busy, we automatically increase the batch window from 15 seconds to 5 minutes, reducing the number of concurrent transactions. This ensures we never overwhelm a client's database while still delivering data as fast as their infrastructure can safely ingest it.

05The "Small Files" problem

The biggest operational risk of micro-batching into a data lake (like S3 or GCS) is the "small files" problem. Writing a new Parquet file every 10 seconds creates 8,640 files a day per pipeline. When an analytics engine like Athena or Presto tries to query a month of data, the metadata overhead of opening 250,000 tiny files will crash the query. The solution is a secondary background process (compaction) that periodically merges these tiny micro-batch files into larger, optimized 1GB blocks.

// 03 — pipeline math

Optimising the
batch window.

Setting the right micro-batch interval is a balancing act between latency SLAs and database write capacity. DataFlirt dynamically tunes these windows based on target throughput.

Optimal Batch Size = Bopt = min(Tmax × Rin, Smax)
Bounded by maximum acceptable latency (T_max) and max payload size (S_max). DataFlirt ingestion model
Write Overhead = Ow = Cconn + (Nrecords × Tinsert)
Connection overhead (C_conn) dominates if batches are too small. Database tuning principles
Delivery Latency = Ltotal = Tscrape + Textract + Twindow + Twrite
The batch window (T_window) is usually the largest controllable variable. Pipeline SLA definitions
// 04 — pipeline trace

A 60-second
ingestion window.

Trace of a micro-batch worker aggregating scraped product pricing data before committing it to a Snowflake warehouse.

Kafka consumer60s windowSnowflake COPY
edge.dataflirt.io — live
CAPTURED
// window start: 14:02:00 UTC
kafka.consume: topic="raw_prices" offset=849201
buffer.status: collecting...
buffer.records: 1,204 // t+15s
buffer.records: 2,840 // t+30s
buffer.records: 3,912 // t+45s

// window close: 14:03:00 UTC
buffer.size: 4,218 records
buffer.bytes: 1.8 MB

// transform & validate
transform.apply: schema_v4
transform.errors: 2 dropped (type_mismatch)

// load
s3.stage: s3://df-stage/batch_849201.parquet
snowflake.copy: table="stg_prices"
snowflake.status: 200 OK
latency.end_to_end: 62.4s
// 05 — failure modes

Where micro-batches
break down.

Ranked by frequency of occurrence across high-throughput data pipelines. The most common issues stem from downstream database limits rather than the scraping layer itself.

PIPELINES MONITORED ·   240+ active
AVG WINDOW ·  ·  ·  ·  ·  15–60 seconds
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Downstream lock contention

% of failures · Too many concurrent batch writes block the DB
02

Small file syndrome

% of failures · Data lakes choked by millions of tiny Parquet files
03

Out-of-memory on spikes

% of failures · Buffer exceeds RAM during high-throughput bursts
04

Schema drift mid-batch

% of failures · Target site changes format halfway through a window
05

Late-arriving data

% of failures · Scrape jobs delayed by retries miss their window
// 06 — DataFlirt's architecture

Adaptive batching,

scaling with the scrape rate.

Fixed time windows fail when scrape rates fluctuate. If a target blocks us and throughput drops, a fixed 60-second window might write batches of 5 records, destroying warehouse efficiency. DataFlirt uses adaptive micro-batching: we flush based on time or volume, whichever hits first, and dynamically adjust the thresholds based on the downstream sink's health. If Snowflake is under heavy load, we automatically widen the window to reduce connection overhead.

worker-04.metrics

Live telemetry from an adaptive micro-batch ingestion worker.

worker.status active
trigger.mode adaptive · time OR volume
threshold.time 60s
threshold.volume 10,000 records
last_flush.reason volume_exceeded
last_flush.size 10,000 records
sink.backpressure low

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About ingestion architectures, latency trade-offs, and how DataFlirt delivers fresh data without breaking your warehouse.

Ask us directly →
What's the difference between micro-batch and true streaming? +
True streaming (like Apache Flink) processes and emits records one by one as they arrive. Micro-batching (like Spark Structured Streaming) collects records for a short period, then processes them as a group. Streaming offers millisecond latency but is complex and expensive. Micro-batching offers second-to-minute latency with vastly superior throughput and lower infrastructure costs.
Why not just insert scraped records into the database one by one? +
Single-record inserts destroy database performance. Every insert carries connection, transaction, and indexing overhead. If you scrape 5,000 pages a second and do single inserts, you will exhaust your database connections and lock your tables. Grouping those 5,000 records into a single 1-second micro-batch reduces the transaction overhead by 99.9%.
How does micro-batching handle schema changes? +
If a target site changes its layout mid-batch, half the records might have a new schema. A robust pipeline validates the entire batch against a schema registry before writing. If a mismatch is detected, the batch is split: valid records are written, and invalid records are routed to a dead-letter queue for quarantine and review, preventing pipeline halts.
What happens if a micro-batch fails halfway through processing? +
Micro-batches must be idempotent. If a batch fails during the write phase, the pipeline simply re-processes the same offset range from the message queue (like Kafka). Because the batch is written as a single atomic transaction, there are no partial writes, ensuring exactly-once processing semantics.
How does DataFlirt handle late-arriving scraped data? +
Scraping is inherently unpredictable—a proxy timeout might delay a record by 30 seconds. We use event-time processing rather than processing-time. The record is stamped with the exact time it was extracted. If it misses its primary micro-batch window, it is included in the next available batch, and downstream systems use the event timestamp to place it in the correct chronological order.
Are there compliance benefits to micro-batching? +
Yes. Micro-batching provides a natural checkpoint for data minimization and PII scrubbing. Instead of trying to redact sensitive data on the fly per record, the batch transformation layer can apply masking rules, drop non-compliant fields, and verify consent flags across the entire payload before it ever touches persistent storage.
$ dataflirt scope --new-project --target=micro-batch-processing READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h