← Glossary / Structured Logging

What is Structured Logging?

Structured logging is the practice of emitting application logs as machine-readable data objects — typically JSON — rather than free-text strings. In a scraping pipeline, it transforms debugging from a grep-based text search into a queryable database operation. When a pipeline processes millions of requests across thousands of proxies, structured logs allow you to instantly aggregate block rates by ASN, track schema drift by target, and isolate memory leaks to specific worker nodes.

ObservabilityJSONDebuggingTelemetryInfrastructure
// 02 — definitions

Stop grepping,
start querying.

Why emitting plain text logs in a distributed scraping pipeline is a fast path to operational blindness.

Ask a DataFlirt engineer →

TL;DR

Structured logging forces every log event to be a key-value pair. Instead of writing "Failed to fetch product 12345: 403 Forbidden", you emit a JSON object with event: fetch_failed, status: 403, target_id: 12345, and proxy_asn: 7922. This allows log aggregators like Elasticsearch or Datadog to index the fields, enabling complex alerting and root-cause analysis at scale.

01Definition & structure
Structured logging is the architectural decision to format application logs as structured data (usually JSON) instead of plain text. A structured log separates the human-readable message from the machine-readable context. It typically includes standard fields like timestamp, level, and event_name, alongside domain-specific payloads like proxy_ip, target_url, or http_status.
02How it works in practice
Instead of using print() or basic logging modules, developers use libraries like Pino (Node.js), Structlog (Python), or Logrus. These libraries automatically serialize context dictionaries into JSON. The logs are written to stdout or a file, picked up by a forwarder (like Fluentd or Vector), and shipped to a central index (like Elasticsearch, Datadog, or Splunk). Engineers then write queries like event: "captcha_hit" AND proxy_asn: 7922 to diagnose issues.
03The cardinality trap
The most common failure mode in structured logging is high cardinality. If you log {"user_12345_status": "failed"}, the aggregator creates a new database column for every user ID. This causes a "mapping explosion" that can crash the entire logging cluster. The correct structure is {"user_id": "12345", "status": "failed"}, which uses exactly two columns regardless of how many users exist.
04How DataFlirt handles it
We enforce a strict JSON schema across our entire fleet. Every log event must include a trace_id, job_id, and worker_id. We use tail-based sampling at the edge: workers buffer logs locally and only forward the granular network-level JSON if the scrape fails or encounters an anomaly. This gives us 100% visibility into failures while reducing our Elasticsearch ingest costs by over 90%.
05Did you know?
Logging is often the hidden bottleneck in high-concurrency scrapers. Synchronous disk I/O or blocking network calls to a logging API can stall the event loop, artificially limiting how many requests a worker can handle. High-performance pipelines always use asynchronous, non-blocking loggers that write to a local buffer or memory queue.
// 03 — log economics

How much telemetry
can you afford?

Structured logs are verbose. Emitting a 2KB JSON object for every HTTP request in a 10M req/day pipeline generates 20GB of daily log volume. DataFlirt uses dynamic sampling to control ingest costs.

Log Volume (GB/day) = req_rate × bytes_per_log × 86400 / 109
Unsampled request-level logging scales linearly with crawl volume. Infrastructure capacity planning
Sampling Rate = 1.0 for errors, 0.01 for 200 OKs
Keep all failures, sample successes to prove the pipeline is alive. DataFlirt observability SLO
Query Latency = O(log N) on indexed fields vs O(N) for full-text
Why we extract proxy_ip as a top-level key instead of burying it in a message string. Database indexing fundamentals
// 04 — the payload

A single 403,
fully contextualised.

A structured log event from a DataFlirt worker hitting a Cloudflare block. Notice how the context (proxy, target, fingerprint) is separated from the message.

JSONElasticsearchDatadog
edge.dataflirt.io — live
CAPTURED
// log.error("fetch failed", context)
{
"timestamp": "2026-05-19T14:22:10.045Z",
"level": "ERROR",
"event": "http.response.blocked",
"worker_id": "df-node-us-east-4a",
"target": {
"domain": "target-ecommerce.com",
"url": "/product/sku-99281"
},
"network": {
"proxy_pool": "residential_US",
"exit_ip": "198.51.100.42",
"asn": 7922,
"status_code": 403
},
"fingerprint": {
"ja4": "t13d1516h2_8daaf6152771"
}
}
// 05 — log cardinality

What breaks
log aggregators.

Structured logging fails when developers treat JSON keys like free text. High cardinality — creating unique keys for dynamic data — will crash your Elasticsearch cluster's mapping state.

MAX KEYS ·  ·  ·  ·  ·    1000 per index
RETENTION ·  ·  ·  ·  ·   14 days hot
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Dynamic keys

Mapping explosion · e.g., {"user_123": "failed"} instead of {"user": "123"}
02

Unbounded string lengths

Storage bloat · Dumping full HTML bodies into a log field
03

Deeply nested JSON

Query latency · Exceeds index depth limits, slows queries
04

Type coercion conflicts

Dropped logs · Field 'status' is int in one log, string in another
05

Missing correlation IDs

Orphaned traces · Cannot trace a request across microservices
// 06 — telemetry stack

Log everything locally,

sample aggressively at the edge.

DataFlirt workers emit 100% of events as structured JSON to local disk. A sidecar agent tails these files, applies dynamic sampling rules, and forwards the filtered stream to our central observability cluster. If a pipeline's error rate spikes above 1%, the sidecar automatically disables sampling for that specific job, giving us full-fidelity debug data exactly when we need it, without paying for it when we don't.

Log Forwarder State

Live metrics from a Fluent Bit sidecar on a scraping worker.

node.id worker-eu-west-12
events.emitted 1,442/sec
events.dropped 1,420/sec
events.forwarded 22/sec
sampling.strategy dynamic_error_biased
buffer.usage 12%
destination elasticsearch-prod

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About log aggregation, schema enforcement, cost management, and how DataFlirt handles telemetry at scale.

Ask us directly →
Why not just use regex to parse plain text logs? +
Because log formats change, and regex is brittle. When a developer adds a new variable to a print statement, your regex breaks. Structured logging pushes the schema definition to the emitter, ensuring the parser never has to guess what a string means.
What is a correlation ID and why do I need it? +
A unique identifier (like a UUID) generated at the start of a scraping job and passed down to every sub-task, HTTP request, and extraction step. It allows you to query trace_id: 123 and see the exact sequence of events across multiple microservices that led to a failure.
How do you handle logging full HTML bodies for debugging? +
We don't put them in the structured log. We write the HTML to an S3 bucket using the trace ID as the filename, and log the S3 URI in the structured event. Logging multi-megabyte strings destroys aggregator performance and inflates storage costs.
What happens if a log field changes type? +
In systems like Elasticsearch, this causes a mapping conflict and the log is dropped. If proxy_port is an integer in one worker and a string in another, the index rejects the mismatch. Enforce strict schemas in your logger configuration.
How does DataFlirt manage log volume costs? +
We use tail-based sampling. We buffer logs locally for 60 seconds. If the scrape succeeds, we drop the granular network logs and only forward a summary metric. If it fails, we forward the entire buffer, giving us the full context leading up to the error.
Should I log proxy credentials? +
Never. Structured logging makes it dangerously easy to accidentally serialize an entire configuration object. Implement a global redaction filter in your logging library that masks keys matching password, secret, token, or auth before the JSON is serialized.
$ dataflirt scope --new-project --target=structured-logging READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h