← Glossary / Scraper Logging

What is Scraper Logging?

Scraper logging is the systematic capture of state, network, and extraction events emitted during a scraping job's lifecycle. In a production pipeline, logs are not just debug trails — they are the primary telemetry for detecting silent schema drift, proxy pool exhaustion, and anti-bot classifier shifts. Without structured, high-cardinality logging, a pipeline failure is a black box; with it, it is a queryable incident.

TelemetryObservabilityStructured LogsDebuggingInfrastructure
// 02 — definitions

Visibility into
the void.

When a pipeline drops from 10,000 records to 400, your logs are the only thing standing between a quick fix and a multi-day forensic nightmare.

Ask a DataFlirt engineer →

TL;DR

Scraper logging captures the exact sequence of events — HTTP requests, proxy assignments, DOM parsing outcomes, and schema validations — that occur during a run. At scale, logs must be structured (JSON), centralized, and indexed. Grepping through flat text files does not work when you have 400 concurrent workers distributed across a Kubernetes cluster.

01Definition & structure
Scraper logging is the practice of recording discrete events during a web scraping operation. A comprehensive log payload typically includes:
  • context — worker ID, timestamp, trace ID, target URL
  • network — proxy IP, ASN, TLS fingerprint, HTTP status, TTFB
  • extraction — schema version, fields matched, fields missing
  • outcome — success, soft block, hard ban, timeout
These logs form the foundation of pipeline observability, allowing engineers to diagnose failures without having to reproduce them locally.
02The shift to structured logging
Legacy scrapers print strings like "Fetched page 5 successfully". Modern infrastructure emits structured JSON objects. Structured logging separates the data from the message, allowing log aggregators (like Elasticsearch, Datadog, or ClickHouse) to index the fields. This means you can instantly query for http.status: 403 AND proxy.asn: 7922 to see if a specific ISP is being blocked by the target.
03Log levels in scraping
Proper use of log levels prevents storage exhaustion. TRACE is for local debugging (full HTML dumps, raw headers). DEBUG tracks granular steps (CSS selector evaluation). INFO records major milestones (job started, batch completed). WARN flags recoverable issues (proxy timeout, retry triggered, optional field missing). ERROR is reserved for terminal failures (schema drift, permanent IP ban) that require human intervention.
04How DataFlirt handles it
We treat logs as first-class data products. Every worker in our fleet emits NDJSON directly to a Kafka buffer, which drains into ClickHouse. We use aggressive dynamic sampling: 200 OKs are sampled at 1%, while 403s, CAPTCHAs, and extraction errors are captured at 100% with full context. This ensures our engineers have perfect visibility into pipeline degradation without paying to store billions of identical success messages.
05The cost of over-logging
A common failure mode for junior data teams is logging the raw HTML response of every page to their central logging server. At 100 requests per second, with an average page size of 200 KB, you are generating over 1.7 TB of logs per day. This causes disk I/O bottlenecks on the workers, spikes cloud egress costs, and makes the logging cluster too slow to query when an actual incident occurs.
// 03 — telemetry math

Calculating log
volume and cost.

High-concurrency pipelines generate terabytes of telemetry. DataFlirt models log ingestion rates to balance debuggability against storage costs, aggressively sampling success paths while retaining 100% of failure traces.

Log Volume per Job = V = R × (Sreq + Sext + Serr)
Records fetched multiplied by the average byte size of request, extraction, and error logs. Infrastructure sizing model
Dynamic Sampling Ratio = Psample = 1.0 − (Successes / Total) × 0.99
Keep 1% of successful request logs, but 100% of errors and retries. DataFlirt telemetry config
Log Retention Cost = C = (Vhot × $0.08) + (Vcold × $0.01)
Hot storage (Elasticsearch/ClickHouse) vs cold storage (S3 archive) per GB. Cloud economics
// 04 — structured trace

A single request,
fully contextualised.

A parsed log payload from a DataFlirt worker encountering a soft block. Notice the cardinality: proxy IP, target URL, schema version, and fingerprint hash are all queryable dimensions.

JSONClickHouseTrace ID
edge.dataflirt.io — live
CAPTURED
// log.timestamp: 2026-05-19T08:14:22Z
level: "WARN"
trace_id: "req_8f72b1a9"
worker_id: "node-us-east-14"
target.url: "https://target.com/p/1234"

// network context
proxy.exit_ip: "203.0.113.42"
proxy.asn: "ASN7922 · Comcast"
tls.ja4: "t13d1516h2_8daaf6152771"

// execution context
http.status: 403
http.ttfb_ms: 842
anti_bot.detected: true
anti_bot.vendor: "Cloudflare"

// action taken
action: "quarantine_proxy_ip"
retry.queued: true
// 05 — log volume drivers

Where the gigabytes
actually come from.

The components of a scraping pipeline that generate the most telemetry. Unbounded logging in these areas causes disk exhaustion and ingestion bottlenecks.

PIPELINES ·  ·  ·  ·  ·   300+ active
RETENTION ·  ·  ·  ·  ·   14-day hot
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Raw HTML payload dumps

highest volume · saving the DOM on extraction failure
02

Network trace / HAR files

high frequency · capturing all XHRs in headless mode
03

Proxy rotation events

high frequency · handshake and connection timings
04

DOM selector evaluation

verbose · tracing which XPath matched what
05

Schema validation warnings

moderate · type coercion and missing fields
// 06 — DataFlirt's telemetry stack

Log everything on failure,

sample aggressively on success.

DataFlirt's logging architecture is built on ClickHouse for high-throughput ingestion. We do not log flat strings; every worker emits strongly typed JSON. When a request succeeds, we log a minimal 120-byte summary. When a request fails, we capture the full context: the proxy exit node, the JA4 TLS fingerprint, the exact CSS selector that missed, and a compressed snapshot of the DOM. This asymmetric logging strategy gives us perfect forensic visibility without bankrupting our infrastructure.

worker-telemetry-node-04

Live ingestion metrics for a single scraping worker.

ingest.rate 4.2 MB/s
events.success sampled at 1%ok
events.error sampled at 100%verbose
format NDJSON
destination ClickHouse cluster
buffer.queue 12 MBhealthy

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About logging strategies, structured data, compliance, and how DataFlirt manages telemetry at scale.

Ask us directly →
Why not just use standard console.log or print statements? +
Flat text logs are unqueryable at scale. If you have 50 workers scraping a site and you want to find all 403 errors that occurred on a specific proxy ASN in the last hour, a grep command across 50 servers is impossible. Structured logging (JSON) allows you to index fields like http.status and proxy.asn in a database, making that query take milliseconds.
Should we log the raw HTML of every page we scrape? +
No. Logging the raw HTML of every successful request will exhaust your storage budget within days. You should only log the raw HTML (or a compressed snapshot) when an extraction fails. This gives your engineers the exact DOM state needed to fix the broken selector, without storing terabytes of redundant data.
Is it legal to log PII if we scrape it by accident? +
Under GDPR and CCPA, logging PII (Personally Identifiable Information) subjects your log storage to the same compliance requirements as your main database. If your scraper accidentally ingests PII, and you dump the raw response into your logs, your logs are now toxic. You must implement log masking or redaction at the worker level before the payload is shipped to centralized storage.
How does DataFlirt handle logging for headless browser sessions? +
Headless browsers generate massive amounts of noise — every image request, font load, and tracking pixel emits an event. We filter out static asset requests at the Playwright interception layer. We only log the main document request, XHR/fetch calls that return JSON, and console errors emitted by the target page's own JavaScript.
How do you manage log volume on pipelines doing 10M+ requests a day? +
Through dynamic sampling and tiered retention. We log 100% of errors, retries, and schema validation failures. For successful 200 OK requests that pass extraction, we log a 1% sample just to prove the pipeline is flowing. Hot logs are kept in ClickHouse for 14 days for immediate debugging, then rolled into compressed S3 archives for 90 days.
What is a trace ID and why is it important in scraping? +
A trace ID is a unique identifier generated when a URL is pulled from the queue. It is attached to the HTTP request, the proxy assignment, the extraction event, and the database write. If a record looks wrong in the final dataset, the trace ID allows you to pull up the exact sequence of events that produced it, across multiple microservices.
$ dataflirt scope --new-project --target=scraper-logging READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h