← Glossary / Log Aggregation

What is Log Aggregation?

Log aggregation is the process of centralizing, parsing, and indexing telemetry data from distributed scraping fleets into a single searchable sink. When you run 5,000 concurrent workers across 40 proxy zones, local stdout is useless. Aggregation turns millions of isolated HTTP events, parser exceptions, and proxy timeouts into queryable streams, allowing engineers to diagnose silent pipeline failures, track block rates, and audit data lineage before bad records hit the warehouse.

ObservabilityELK / DatadogTelemetryDistributed SystemsDebugging
// 02 — definitions

Centralise
the noise.

How distributed scraping fleets turn millions of isolated worker events into a coherent, queryable timeline of pipeline health.

Ask a DataFlirt engineer →

TL;DR

Log aggregation collects stdout, stderr, and structured JSON payloads from hundreds of ephemeral scraper nodes and ships them to a central datastore like Elasticsearch, ClickHouse, or Datadog. It is the only way to debug a 403 block rate spike when the workers that experienced it have already been destroyed.

01Definition & structure
Log aggregation is the architectural pattern of collecting log files from multiple distributed sources, parsing them into a uniform schema, and storing them in a centralized, searchable database. In a scraping context, a typical pipeline involves:
  • Emitters — Scrapy spiders, Playwright scripts, or proxy gateways writing JSON to stdout.
  • Shippers — Daemon agents (like Fluentd, Vector, or Filebeat) that tail the local logs and forward them over the network.
  • Brokers — Message queues (like Kafka) that buffer the logs to prevent data loss during traffic spikes.
  • Sinks — The final database (Elasticsearch, Datadog, ClickHouse) where logs are indexed for querying.
02How it works in practice
When a scraper encounters a Cloudflare 403, it emits a structured error log containing the target URL, the proxy IP used, the Ray ID, and the exact headers sent. The local shipper picks this up, enriches it with the worker's node ID and region, and sends it to the central index. An engineer investigating a spike in block rates can then query the index for status: 403 AND proxy_zone: residential_US to instantly see if a specific proxy subnet has been burned, without ever SSHing into a server.
03The cardinality problem
High cardinality occurs when a log field has millions of unique values — like a dynamically generated URL or a unique session ID. Traditional inverted-index databases (like Elasticsearch) consume massive amounts of RAM to index high-cardinality fields. Modern scraping observability stacks often shift to columnar databases (like ClickHouse) which handle high-cardinality telemetry much more efficiently, allowing engineers to log exact URLs and Ray IDs without bankrupting the infrastructure budget.
04How DataFlirt handles it
We treat logs as data products. Our fleet generates billions of events daily. We use Vector at the edge to parse and route telemetry. Metrics (success rates, latencies) are stripped out and sent to Prometheus. Raw logs are evaluated: errors and schema drifts are routed to a hot ClickHouse cluster for immediate alerting, while the vast majority of successful request logs are batched into Parquet files and dumped into S3. This gives us sub-second query times for active incidents and cheap, infinite retention for historical audits.
05The "Log Everything" misconception
A common mistake in early-stage scraping teams is logging the full HTML response body of every request to the aggregator. At 100 requests per second, a 200KB HTML payload will generate 1.7 TB of logs per day. The aggregator will choke, and the storage bill will be catastrophic. HTML bodies should only be logged conditionally (e.g., when a selector fails or a CAPTCHA is detected) and should ideally be written to object storage (S3) with only a reference ID sent to the log aggregator.
// 03 — the math

The cost of
observability.

Logging everything is easy; paying to index it is hard. DataFlirt models log volume and retention to ensure observability costs don't eclipse the compute costs of the actual scraping pipeline.

Daily log volume = V = workers × req/s × bytes/log × 86,400
10k req/s at 1KB per log = ~864 GB/day. Storage scales faster than compute. Infrastructure capacity planning
Ingestion latency = L = tindexedtemitted
Time from a proxy timeout occurring to it appearing in a Grafana dashboard. Observability SLO
Signal-to-noise ratio = S = actionable_errors / total_logs
DataFlirt drops 200 OK debug logs at the edge to keep S > 0.15. Log routing rules
// 04 — log shipping trace

From ephemeral worker
to persistent index.

A live trace of a Vector agent tailing a scraper container, filtering out noise, and shipping a structured error payload to a ClickHouse observability cluster.

Vector.devJSON parsingClickHouse sink
edge.dataflirt.io — live
CAPTURED
// agent startup
source.docker: "tailing /var/lib/docker/containers/*/*.log"
transform.parse: "json_extract"

// log stream processing
event.in: { "level": "INFO", "status": 200, "url": "..." }
filter.rule: "drop if status == 200 and level == 'INFO'" // dropped

event.in: { "level": "ERROR", "type": "SelectorNotFound", "target": ".price" }
enrich.metadata: applied
event.out.worker_id: "node-eu-west-42"
event.out.proxy_zone: "residential_FR"

// sink delivery
batch.size: 500 events
sink.clickhouse: "INSERT INTO scraper_logs"
delivery.status: 200 OK 14ms
// 05 — log volume drivers

What fills up
the index.

Ranked by share of total indexed bytes across DataFlirt's observability stack. Errors and HTML dumps consume vastly more storage than successful request metadata.

DAILY INGEST ·  ·  ·  ·   4.2 TB
RETENTION ·  ·  ·  ·  ·   14 days hot
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Anti-bot challenge HTML dumps

% of indexed bytes · Full DOM captures for forensic debugging
02

Proxy connection timeouts

% of indexed bytes · High volume during residential pool rotation
03

Schema validation failures

% of indexed bytes · Includes the quarantined record payload
04

Browser crash stack traces

% of indexed bytes · Playwright OOM and context destroyed errors
05

HTTP 429 / 403 metadata

% of indexed bytes · Headers and Ray IDs from edge blocks
// 06 — our architecture

Log everything,

index only what matters.

DataFlirt uses a decoupled observability stack to manage costs without sacrificing visibility. Workers emit structured JSON to local buffers. A daemonset ships these to a Kafka topic, which acts as a shock absorber during massive block events. We route critical errors and schema validation failures to hot storage (ClickHouse) for real-time alerting, while raw HTTP traces and 200 OKs go straight to cold S3 buckets for forensic audits. If a pipeline breaks, we have the error in milliseconds; if we need to audit a successful run from last month, we query the lake.

Log routing rules

How a single worker's telemetry is split across storage tiers.

level: ERROR ClickHousehot index
type: SchemaDrift ClickHousealert trigger
level: INFO S3 / Icebergcold storage
type: HTML_Dump S37-day TTL
metric: req_count Prometheustime-series
delivery.guarantee at-least-once via Kafka

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about telemetry, storage costs, PII handling, and how DataFlirt monitors distributed scraping fleets.

Ask us directly →
Why not just use stdout and grep for debugging? +
Because scraping infrastructure is ephemeral. When a Kubernetes pod running a headless browser hits an Out-Of-Memory error, the container is killed and replaced. If you rely on local stdout, the evidence dies with the container. Log aggregation ensures the autopsy data survives the death of the worker.
What's the difference between structured logging and log aggregation? +
Structured logging is the format (emitting logs as JSON rather than plain text strings). Log aggregation is the transport and storage (shipping those JSON objects to a central database). You need structured logging to make log aggregation actually queryable — otherwise you're just paying to store unsearchable text.
How do you handle PII or sensitive data in scraped logs? +
We enforce strict data masking at the edge. The log shipping agent (Vector/Fluentbit) is configured to redact authorization headers, session cookies, and specific JSON paths before the log ever leaves the worker node. The central index never receives the sensitive payload.
How does DataFlirt manage the cost of log storage? +
By aggressively sampling success paths and decoupling storage tiers. We don't need to index every 200 OK in a hot database. We route metrics (counts, latencies) to Prometheus, errors to ClickHouse, and raw request logs to S3. This keeps our hot storage footprint small and our queries lightning fast.
Can log aggregation detect silent extraction failures? +
Yes, if your extractors log their validation scores. A silent failure means the HTTP request succeeded but the data was missing. By logging the schema completeness score of every record, we can trigger an alert in Grafana when the moving average of a specific field drops below 95%, catching selector rot instantly.
What happens if the central log aggregator goes down? +
Workers buffer logs to local disk, and our Kafka brokers queue the streams. If the ClickHouse sink goes offline, Kafka retains the logs for up to 72 hours. Once the sink recovers, the backlog is drained. No telemetry is lost during an observability outage.
$ dataflirt scope --new-project --target=log-aggregation READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h