← Glossary / Kubernetes Scraping Cluster

What is Kubernetes Scraping Cluster?

A Kubernetes scraping cluster is a distributed orchestration environment tuned specifically for high-throughput data extraction workloads. Unlike standard microservices, scraping clusters must handle massive egress bandwidth, volatile memory spikes from headless browsers, and rapid IP rotation at the pod level. It's the standard infrastructure pattern for scaling beyond a single machine, allowing you to run thousands of concurrent spiders while isolating browser crashes and proxy failures from the rest of your pipeline.

OrchestrationK8sAuto-scalingHeadless BrowsersEgress

// 02 — definitions

Orchestrating
the fleet.

How containerised extraction workers are scheduled, scaled, and isolated to survive the chaos of the modern web.

Ask a DataFlirt engineer →

TL;DR

A Kubernetes scraping cluster manages the lifecycle of extraction pods — scheduling lightweight HTTP fetchers alongside heavy Playwright containers. It handles auto-scaling based on queue depth, restarts crashed browser instances automatically, and routes egress traffic through proxy sidecars to maintain IP diversity at scale.

01Definition & structure

A Kubernetes scraping cluster is a container orchestration system configured to run distributed web scraping workloads. The architecture typically consists of a control plane (managing state), worker nodes (providing compute), and pods (running the actual scraper code). Key components include message queues (like Redis or Kafka) for distributing URLs, proxy sidecars for network routing, and persistent volumes for temporary data storage before it is shipped to a data lake.

02Handling browser volatility

Headless browsers like Puppeteer and Playwright are notorious for memory leaks and zombie processes. In a traditional VM setup, a leaked browser process requires manual intervention or complex supervisor scripts. Kubernetes solves this natively via cgroups. By setting strict resources.limits.memory on the pod, Kubernetes will automatically issue an OOMKilled event when the browser exceeds its budget, instantly spinning up a fresh, clean replica to take its place.

03Network routing & IP management

In a scraping cluster, egress traffic is more complex than standard microservices. Thousands of pods cannot share the same node IP, or target sites will instantly block the NAT gateway. Clusters handle this by routing pod traffic through proxy sidecars or egress gateways that tunnel requests to external residential or datacenter proxy pools, ensuring that the target site sees a diverse array of IP addresses rather than the cluster's internal network structure.

04How DataFlirt handles it

We operate multi-tenant Kubernetes clusters optimized purely for extraction. We use custom Custom Resource Definitions (CRDs) to define "ScrapeJobs" which automatically provision the exact ratio of HTTP fetchers to headless browsers needed for a specific target. Our entire worker fleet runs on spot instances, driven by KEDA autoscalers that watch our Kafka topic depths. If a node is preempted, the unacknowledged messages are instantly reassigned, resulting in zero data loss.

05The "noisy neighbor" problem

A common pitfall in scraping clusters is the "noisy neighbor" effect, where a CPU-intensive pod (e.g., parsing a massive JSON payload or executing complex XPath queries) starves other pods on the same node. This leads to cascading timeout errors across the cluster. Mitigation requires setting accurate resources.requests.cpu to ensure Kubernetes schedules pods across nodes evenly, rather than packing them tightly based only on memory availability.

// 03 — cluster sizing

How many nodes
do you need?

Sizing a scraping cluster requires balancing CPU for parsing against memory for rendering. DataFlirt's auto-scaler uses these baseline formulas to provision spot instances dynamically based on pipeline demands.

Pod memory budget = M = base_os + (tabs × tab_overhead)

Playwright needs ~200MB per active tab. Under-provisioning guarantees OOM kills. Container resource limits

Cluster concurrency limit = C = (node_count × node_mem) / M

Memory is almost always the binding constraint for headless workloads, not CPU. Capacity planning model

DataFlirt scale factor = N = queue_depth / (C × target_time)

Drives our Horizontal Pod Autoscaler (HPA) to spin up nodes before the queue backs up. Internal HPA configuration

// 04 — kubectl get events

A browser pod's
short, violent life.

Headless browsers leak memory and crash. Kubernetes expects this. Here is a trace of a Playwright worker hitting its memory limit and being seamlessly replaced by the replica set.

OOMKilledHPA scalingNode affinity

edge.dataflirt.io — live

CAPTURED

// pod lifecycle events
pod/playwright-worker-7b8f: Scheduled on node-pool-spot-4a
container/browser: Started PID 1402

// memory consumption tracking
usage.ram: 450Mi // baseline
usage.ram: 1.2Gi // 5 tabs open
usage.ram: 1.9Gi // memory leak detected in SPA target

// limit reached
event: OOMKilled
reason: "Memory cgroup out of memory"
action: Killing container browser

// orchestration recovery
replicaset/playwright-worker: State: 39/40 ready
pod/playwright-worker-9c2d: Created // replacement spawned
queue.requeue: Success "5 URLs returned to Redis"
cluster.status: Nominal

// 05 — failure domains

Where clusters
break down.

The most common failure modes in distributed scraping infrastructure, ranked by frequency across DataFlirt's managed Kubernetes fleets.

NODES MONITORED · · · 1,200+ instances

POD CHURN · · · · · 40k restarts/day

UPDATED · · · · · · 2026-05-19

01

OOMKilled (Out of Memory)

container limit · Browser memory leaks exceeding pod requests/limits

02

DNS resolution timeouts

network layer · CoreDNS overwhelmed by high-concurrency fetchers

03

SNAT port exhaustion

egress limit · Too many outbound connections from a single node IP

04

CPU throttling

cgroup limit · Parsing heavy JSON/DOMs without adequate CPU shares

05

Spot instance preemption

node lifecycle · Cloud provider reclaiming cheap compute nodes

// 06 — DataFlirt's control plane

Ephemeral by design,

resilient by orchestration.

We run our extraction fleet on a heavily customized Kubernetes distribution. Browser pods are treated as highly disposable — they live for exactly 100 requests before being cycled to prevent memory fragmentation and fingerprint staleness. Network routing is handled by a custom CNI plugin that binds specific proxy exit nodes to individual pods, ensuring that a single IP ban never cascades across the cluster.

Cluster telemetry snapshot

Live metrics from a DataFlirt extraction cluster running a retail catalog pipeline.

cluster.nodes 142 spot instances

pods.active 3,400 workers

hpa.target queue_depth_per_pod < 50

dns.cache_hit 99.4%nodelocal

egress.bandwidth 4.2 Gbps

oom_kills.1h 127 podsauto-recovered

pipeline.status healthy

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About cluster architecture, auto-scaling, handling browser volatility, and how DataFlirt manages distributed extraction at scale.

Ask us directly →

Why use Kubernetes instead of simple EC2 instances or Celery workers? +

Kubernetes provides declarative state and process isolation. If a Playwright worker leaks memory and crashes on a bare EC2 instance, it might take down the whole machine. In K8s, the pod is isolated via cgroups, killed cleanly, and instantly replaced. It also allows us to bin-pack lightweight HTTP scrapers alongside heavy browser scrapers on the same nodes to maximise compute utilisation.

What is the best metric for auto-scaling a scraping cluster? +

Queue depth, not CPU or memory. CPU usage is a lagging indicator in scraping. If you scale based on CPU, your queue will back up massively before new nodes spin up. We use KEDA (Kubernetes Event-driven Autoscaling) to scale our Horizontal Pod Autoscalers based on the length of our Kafka/Redis extraction queues.

How do you handle proxy rotation within a Kubernetes cluster? +

We deploy proxy routing as a sidecar container or via a DaemonSet. The scraper code simply makes requests to localhost:8080. The sidecar intercepts the traffic, attaches the correct authentication headers, and routes it through the external proxy pool. This decouples proxy logic from extraction logic.

How does DataFlirt manage DNS resolution issues at high concurrency? +

Standard CoreDNS deployments collapse under the DNS query volume of a high-throughput scraping cluster. We deploy NodeLocal DNSCache on every node. This intercepts DNS queries locally, caches them, and prevents the cluster's central DNS service from becoming a bottleneck that causes ERR_NAME_NOT_RESOLVED failures.

Can I run a scraping cluster entirely on spot/preemptible instances? +

Yes, and you should. Scraping is an inherently fault-tolerant, asynchronous workload. If a spot node is reclaimed by AWS/GCP, the pods die, the unacknowledged URLs return to the message queue, and they are picked up by other nodes. Running on spot instances reduces compute costs by 60–80%.

How do you prevent 'noisy neighbor' problems between pods? +

Strict resource requests and limits. We set memory limits equal to memory requests (Guaranteed QoS class) for browser pods to prevent them from bursting and starving other pods. For CPU, we allow bursting, but we use node affinity rules to ensure heavy Playwright pods aren't scheduled on the same nodes as latency-sensitive control plane services.

$ dataflirt scope --new-project --target=kubernetes-scraping-cluster READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h