← Glossary / PySpark

What is PySpark?

PySpark is the Python API for Apache Spark, an open-source distributed computing framework used to process massive datasets across a cluster of machines. In scraping pipelines, it's the engine that takes over after extraction — handling deduplication, schema validation, and complex joins across hundreds of millions of records where single-node tools like Pandas would immediately run out of memory.

Data EngineeringDistributed ETLBig DataLazy EvaluationApache Spark
// 02 — definitions

Beyond
single-node.

When your daily scrape yields 500GB of JSON, you can't process it on one machine. PySpark distributes the workload across a fleet of workers.

Ask a DataFlirt engineer →

TL;DR

PySpark allows data engineers to write Python code that executes in parallel across a cluster of worker nodes. It uses lazy evaluation to optimize execution plans before running them, making it the industry standard for transforming raw scraped data into clean, query-ready Parquet or Iceberg tables at scale.

01Definition & structure

PySpark is the Python interface for Apache Spark. It allows developers to write Python code that is translated into JVM instructions and executed across a distributed cluster of machines. The core abstraction is the DataFrame — conceptually similar to a Pandas DataFrame, but distributed across many nodes.

A PySpark application consists of a Driver program (which runs the main function and creates the SparkContext) and multiple Executors (worker nodes that run the actual tasks and store data in memory or on disk).

02Transformations vs Actions

PySpark operations are strictly divided into two categories:

  • Transformations (e.g., select, filter, join): These are lazy. They don't compute anything immediately; they just add a step to the execution plan.
  • Actions (e.g., count, collect, write): These trigger the actual execution of the DAG.

This separation is what makes PySpark fast. If you filter a billion-row dataset and then ask for the top 10 rows, Spark's optimizer knows it doesn't need to process the entire dataset — it just needs to find 10 rows that match the filter.

03The Shuffling Problem

Operations like map or filter are "narrow" — they can be computed on a single partition without talking to other nodes. Operations like groupBy or join are "wide" — they require data with the same key to be on the same node. Moving this data across the network is called a Shuffle.

Shuffling is the most expensive operation in PySpark. It involves disk I/O, data serialization, and network I/O. Optimizing a PySpark job almost always comes down to minimizing the amount of data being shuffled.

04How DataFlirt uses it

We use PySpark as the heavy-lifting engine for our post-scrape ETL pipelines. When a distributed crawl finishes, it leaves behind thousands of raw JSONL files in S3. We spin up an ephemeral EMR or Databricks cluster, run a PySpark job to read the raw files, enforce schema contracts, drop duplicates, and merge the new data into a historical Delta Lake table.

By decoupling the scraping infrastructure (which is I/O bound and needs proxies) from the processing infrastructure (which is CPU/Memory bound), we can scale both independently and keep costs strictly proportional to volume.

05Did you know: UDFs kill performance

Writing a custom Python function and applying it to a PySpark DataFrame via a User Defined Function (UDF) is a common anti-pattern. Because Spark runs on the JVM, using a Python UDF forces Spark to serialize the data, send it to a Python process, run the function, and serialize it back to the JVM for every single row.

This serialization overhead can make a job 10x to 100x slower. Always use native PySpark SQL functions (pyspark.sql.functions) whenever possible, as they execute directly in the JVM.

// 03 — cluster sizing

How much compute
do you need?

Sizing a PySpark cluster for daily ETL jobs requires balancing memory per executor against the volume of shuffled data. Here is how DataFlirt provisions transient clusters for post-scrape processing.

Total Executor Memory = M = data_size × 3
Rule of thumb: allocate 3x the raw data size to account for deserialization and shuffle overhead. Spark Tuning Guidelines
Target Partition Count = P = data_size_mb / 128
Aim for ~128MB per partition to align with HDFS/S3 block sizes and prevent OOMs. DataFlirt ETL SLO
DataFlirt Job Cost = C = (worker_nodes × hourly_rate) × job_duration
Ephemeral clusters mean you only pay for the minutes the transformation is actively running. Internal FinOps model
// 04 — the execution plan

Deduplicating 40M
records in 3 minutes.

A PySpark job trace processing a daily e-commerce scrape. The job reads raw JSONL from S3, drops duplicates based on URL and timestamp, and writes partitioned Parquet.

Spark 3.4Lazy EvaluationParquet Output
edge.dataflirt.io — live
CAPTURED
// spark-submit --deploy-mode cluster etl_job.py
INFO SparkContext: Running Spark version 3.4.1
INFO DAGScheduler: Registering RDD 1 (read JSON)

// lazy evaluation: building the physical plan
plan.step_1: FileScan json s3a://df-raw/2026-05-19/
plan.step_2: HashAggregate(keys=[url#12, timestamp#15]) // wide transformation
plan.step_3: Project [url#12, price#18, stock#19]

// action triggered: execution begins
Stage 0: [====================> ] 85% (170/200 tasks)
Stage 1: [> ] 2% (4/200 tasks) // shuffle read

// job completion
records.read: 42,105,992
records.written: 39,881,004 // 2.2M duplicates dropped
output.path: s3a://df-clean/products/dt=2026-05-19/
job.status: SUCCEEDED in 184s
// 05 — failure modes

Why PySpark jobs
crash and burn.

Distributed computing introduces distributed failure modes. Ranked by frequency across DataFlirt's daily ETL pipelines, these are the most common reasons a Spark job fails mid-flight.

PIPELINES MONITORED ·   140+ active
AVG DATA VOLUME ·  ·  ·   1.2 TB / job
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Executor OOM (Out of Memory)

% of failures · Partitions too large or data skew overloading one node
02

Driver OOM

% of failures · Calling .collect() on a massive DataFrame
03

Shuffle Fetch Failed

% of failures · Network timeout during wide transformations (joins/groupBys)
04

Data Skew

% of failures · One task takes 10x longer than the rest, stalling the stage
05

UDF Serialization Error

% of failures · Passing unpicklable Python objects to worker nodes
// 06 — our architecture

Process where the data lands,

not where it was scraped.

DataFlirt separates the scraping runtime from the data processing runtime. Scrapers write raw, unvalidated JSONL directly to S3. Once a crawl finishes, an Airflow DAG spins up an ephemeral PySpark cluster. The cluster reads the raw data, applies schema validation, deduplicates against historical records, and writes the final dataset to Delta Lake. Compute is only paid for while the data is actively transforming.

spark-submit --deploy-mode cluster

Live metrics from a post-scrape ETL job on a real estate pipeline.

job.id etl-re-IN-042
cluster.size 1 Driver, 8 Executors
memory.per_node 32 GBmemory-optimized
data.input 184 GB JSONL
shuffle.read 42 GBspill-free
data.output 22 GB Parquetsnappy
job.status COMPLETED · 14m 22s

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About PySpark performance, memory management, alternatives, and how DataFlirt handles massive datasets.

Ask us directly →
Why use PySpark instead of Pandas? +
Pandas is restricted to a single machine and requires the entire dataset to fit in RAM. If you have 64GB of RAM, a 70GB CSV will crash your script. PySpark distributes the data across multiple machines (or spills gracefully to disk) and processes it in chunks, allowing you to handle terabytes or petabytes of data without memory limits.
What is lazy evaluation in PySpark? +
When you apply transformations in PySpark (like .filter() or .select()), it doesn't execute them immediately. Instead, it builds a Directed Acyclic Graph (DAG) of the operations. Execution only happens when you call an "action" (like .write() or .count()). This allows Spark's Catalyst Optimizer to reorganize and optimize the query plan before running it.
How do you handle data skew in scraped datasets? +
Data skew happens when one partition is vastly larger than others — for example, grouping by "category" when 80% of products are "Electronics". We handle this by salting the keys (adding a random integer to the key before grouping) to distribute the heavy partition across multiple workers, then performing a second aggregation to combine the salted results.
Do I need PySpark if I'm only scraping 100,000 pages a day? +
Probably not. For 100,000 records, a single-node tool like Pandas, Polars, or DuckDB is faster, cheaper, and easier to maintain. PySpark's overhead (JVM startup, network shuffling) makes it slower for small datasets. We typically introduce PySpark when daily volumes cross the 10-50 million record threshold.
How does DataFlirt deliver data processed by PySpark? +
Our PySpark jobs typically write the final output to S3 or GCS in columnar formats like Apache Parquet or as Delta Lake tables. This allows clients to query the data directly using Athena, BigQuery, or Snowflake without needing to load it into a traditional relational database first.
What's the difference between PySpark and Apache Flink? +
PySpark is fundamentally a batch processing engine (Spark Structured Streaming is just micro-batching). Apache Flink is a true event-driven stream processing engine. We use PySpark for daily or hourly pipeline runs, and Flink for pipelines that require sub-second latency (like live odds or spot pricing feeds).
$ dataflirt scope --new-project --target=pyspark READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h