← Glossary / Apache Spark

What is Apache Spark?

Apache Spark is a distributed, in-memory compute engine used to process massive datasets that exceed the memory capacity of a single machine. In scraping pipelines, it sits downstream of the extraction layer, handling deduplication, schema validation, and complex joins across billions of raw JSON records. When your daily crawl output hits the terabyte scale, single-node Python scripts choke; Spark distributes that workload across a cluster to deliver clean, queryable data within your SLA.

Distributed ComputeETLIn-Memory ProcessingBig DataPySpark
// 02 — definitions

Scale beyond
the single node.

When scraping yields billions of records, processing them requires a paradigm shift from sequential scripts to distributed, fault-tolerant compute graphs.

Ask a DataFlirt engineer →

TL;DR

Apache Spark processes large-scale data by distributing tasks across a cluster of worker nodes. It uses lazy evaluation to build an optimized execution plan (DAG) before running any calculations. For enterprise scraping operations, it is the standard engine for transforming raw, messy S3 dumps into clean, structured Parquet or Delta tables.

01Definition & structure
Apache Spark is an open-source, distributed computing system designed for fast computation on large-scale data. It operates on a master-worker architecture: a central Driver program coordinates the execution, while multiple Executor nodes process data in parallel. Data is abstracted into resilient distributed datasets (RDDs) or DataFrames, which are partitioned across the cluster. This allows operations like filtering, joining, and aggregating to happen simultaneously across hundreds of machines.
02How it works in practice
In a data pipeline, Spark reads raw files (JSON, CSV, HTML) from distributed storage like S3 or HDFS. Because of lazy evaluation, applying transformations doesn't trigger immediate computation. Spark builds a Directed Acyclic Graph (DAG) of the operations. Once an action (like write.parquet()) is called, the Catalyst Optimizer analyzes the DAG, minimizes data movement, and pushes the compiled tasks to the executors. The executors process their assigned partitions in memory, spilling to disk only if RAM is exhausted.
03The shuffle bottleneck
The most expensive operation in Spark is the shuffle. Narrow transformations (like map or filter) happen locally on a single partition. Wide transformations (like groupBy, join, or dropDuplicates) require data with the same key to be co-located on the same executor. This forces massive amounts of data to be serialized, sent across the network, and deserialized. Poorly optimized shuffles are the primary cause of slow jobs and cluster crashes.
04How DataFlirt handles it
We use Spark exclusively for the downstream ETL of our high-volume pipelines. When a client requests a daily crawl of 50 million product listings, the raw JSON is dumped into an S3 bucket. We spin up an ephemeral EMR cluster, run a PySpark job to enforce schema contracts, deduplicate records, and normalize currencies, then write the output to a Delta table. The cluster is terminated immediately after the job succeeds, ensuring we only pay for the exact compute seconds required to process the batch.
05Did you know?
Spark is not a database. It has no native storage engine. It relies entirely on external storage systems (like S3, GCS, HDFS, or Cassandra) to read and write data. This separation of compute and storage is what makes it so cost-effective for cloud architectures—you can scale your storage infinitely without having to pay for always-on compute nodes.
// 03 — cluster sizing

How to tune
Spark for ETL.

Spark performance is dictated by memory management and partition sizing. The math below reflects how DataFlirt provisions ephemeral EMR clusters for daily batch processing of scraped datasets.

Target partition count = P = Total_Data_Size / 128 MB
Aim for 128–200MB per partition to avoid the small files problem and optimize HDFS/S3 reads. Spark Tuning Best Practices
Executor memory overhead = Moverhead = max(Mexecutor · 0.10, 384)
Off-heap memory allocated for VM overhead and interned strings. Crucial to prevent YARN/Kubernetes from killing containers. Apache Spark Configuration
DataFlirt deduplication SLA = Tdedup = (Nrecords · Chash) / (Ecores · Enodes)
Time to deduplicate scales inversely with total executor cores, assuming data is evenly partitioned. Internal Pipeline SLO
// 04 — job execution trace

Deduplicating 1.4B
records in 12 minutes.

A PySpark job trace running on a 20-node cluster. The job reads raw JSON from a daily real estate crawl, drops duplicates based on listing IDs, and writes compressed Parquet to the silver layer.

PySparkDAG SchedulerS3 I/O
edge.dataflirt.io — live
CAPTURED
// spark-submit --deploy-mode cluster --num-executors 20 dedup_job.py
job.id: "app-20260519-0814"
dag.scheduler: built execution plan (3 stages)

// stage 0: read & parse
task.read: "s3://df-raw/real-estate/2026-05-19/*.json.gz"
records.input: 1,482,931,044
partitions.initial: 4,200

// stage 1: shuffle & deduplicate
transform: drop_duplicates(subset=["listing_id", "price"])
shuffle.write: 412.4 GB // network I/O peak
shuffle.read: 412.4 GB
spill.disk: 0 bytes // memory tuned correctly

// stage 2: write output
task.write: "s3://df-clean/real-estate/parquet/"
records.output: 1,204,811,902
dropped.duplicates: 278,119,142

job.status: SUCCESS duration: 12m 41s
// 05 — performance bottlenecks

Where Spark jobs
choke and die.

Spark is powerful but unforgiving of poorly structured data. Ranked by frequency of job failures across DataFlirt's data engineering pipelines.

JOBS ANALYZED ·  ·  ·  ·  14,200+
CLUSTER TYPE ·  ·  ·  ·   AWS EMR
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Data Skew

OOM errors · One partition has 100x more data than others, crashing its executor
02

Excessive Shuffling

Network bound · Wide transformations (joins, groupBys) saturating cluster bandwidth
03

Small Files Problem

S3 overhead · Reading 100,000 tiny JSON files destroys driver metadata memory
04

Driver OOM

Memory limit · Calling collect() on a massive DataFrame instead of writing to storage
05

Garbage Collection

CPU pauses · JVM GC pauses freezing executors and triggering timeout disconnects
// 06 — our architecture

Compute decoupled from storage,

scaling the transform layer independently.

DataFlirt uses Spark to process our largest daily crawls. By decoupling the compute cluster from our S3 data lake, we spin up massive ephemeral clusters just long enough to deduplicate, normalize, and write out Delta tables, then tear them down immediately. This architecture keeps our cost-per-record low while guaranteeing strict delivery SLAs for enterprise data buyers. We don't pay for idle compute, and we never hit single-node memory ceilings.

Spark Job Metrics

Live telemetry from an active ETL job processing scraped product catalogs.

cluster.state active · 24 nodes
executor.memory 32 GB per node
executor.cores 8 per node192 total
data.skew_factor 1.04balanced
shuffle.spill 0 bytes
task.success_rate 99.9%2 retries

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About distributed processing, PySpark, cluster tuning, and how DataFlirt handles terabyte-scale scraping outputs.

Ask us directly →
Why use Spark instead of Pandas for scraped data? +
Pandas is strictly single-node and requires the entire dataset to fit in RAM. If you have a 64GB machine and a 100GB JSON file, Pandas will crash with an OutOfMemory error. Spark distributes that 100GB across multiple machines and processes it in chunks (partitions), allowing you to scale horizontally to petabytes.
Does Spark actually do the web scraping? +
No. Spark is a data processing engine, not a web crawler. The scraping pipeline (using Playwright, HTTP clients, and proxies) fetches the data and writes raw JSON/HTML to an S3 bucket. Spark then reads that bucket, parses the data, cleans it, and writes it to a data warehouse. Spark is the ETL layer, not the fetch layer.
What is lazy evaluation in Spark? +
When you write a Spark transformation (like filtering or selecting columns), Spark doesn't execute it immediately. Instead, it builds a logical execution plan called a Directed Acyclic Graph (DAG). Execution only happens when you call an "action" (like writing to disk or counting rows). This allows Spark's Catalyst Optimizer to reorganize the query for maximum efficiency before running it.
How do you handle data skew in scraped datasets? +
Data skew happens when one key dominates the dataset (e.g., grouping by "country" when 90% of scraped records are from the US). The executor handling the US partition will run out of memory while others sit idle. We handle this via "salting"—appending a random integer to the skewed key to force Spark to distribute the records across multiple partitions, then aggregating again.
Is Spark cost-effective for small datasets? +
No. Spark has significant overhead: JVM startup, DAG scheduling, and network coordination between the driver and executors. If your scraped dataset is under 10GB, a single-node Python script using Polars or DuckDB will almost always be faster and cheaper. Spark's ROI only materializes when data volume exceeds single-node capacity.
How does DataFlirt integrate Spark with Delta Lake? +
We use Spark as the compute engine to write directly to Delta Lake tables on S3. This gives our scraped datasets ACID transactions, schema enforcement, and time-travel capabilities. If a bad scrape job introduces malformed schemas, Delta Lake rejects the write, and Spark rolls back the transaction, ensuring our clients never query corrupted data.
$ dataflirt scope --new-project --target=apache-spark READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h