← Glossary / Data Partitioning

What is Data Partitioning?

Data partitioning is the physical division of a massive dataset into smaller, discrete directories or tables based on the values of one or more columns. In scraping pipelines, it's the difference between scanning 10 terabytes to find yesterday's pricing changes and scanning 10 gigabytes. By aligning storage layout with downstream query patterns, partitioning drastically reduces compute costs and query latency.

Data EngineeringStorage OptimizationQuery PerformanceData LakeETL
// 02 — definitions

Divide and
conquer.

How physical storage layout dictates query performance, and why dumping scraped JSONs into a single S3 bucket is a ticking time bomb.

Ask a DataFlirt engineer →

TL;DR

Data partitioning splits massive datasets into manageable chunks, typically by date, region, or target domain. It allows query engines like Athena, BigQuery, or Snowflake to skip irrelevant data entirely (partition pruning). Without it, every query becomes a full table scan, driving up cloud costs exponentially as your scraped dataset grows.

01Definition & structure

Data partitioning is the technique of dividing a large logical table into smaller physical parts based on the values of specific columns (the partition keys). Instead of storing 100 million scraped records in one massive file, you store them in separate directories based on scrape_date or domain.

To the analyst querying the data, it looks like one single table. To the query engine, it's a structured directory tree. This physical separation is what allows modern data architectures to scale infinitely without degrading performance.

02Partition Pruning

The primary benefit of partitioning is partition pruning. When a user runs a query with a WHERE scrape_date = '2026-05-19' clause, the query engine looks at the partition metadata, identifies the exact directory containing that date, and completely ignores all other directories.

If you have 3 years of daily data, partition pruning allows the engine to read just 1/1000th of the total dataset. This reduces I/O, speeds up the query, and drastically lowers compute costs in serverless warehouses.

03Choosing a Partition Key

A good partition key must balance cardinality (number of unique values) and query patterns. If you partition by a boolean field (e.g., is_active), you only get 2 partitions, which doesn't help much. If you partition by product_id, you get millions of partitions, which breaks the metadata catalog.

For scraped data, scrape_date (YYYY-MM-DD) is the gold standard. It naturally groups data by ingestion time, ensures partitions are roughly equal in size, and aligns perfectly with how data scientists typically query historical trends.

04How DataFlirt handles it

We treat storage layout as a first-class deliverable. When we push data to a client's S3 or GCS bucket, we automatically partition it using Hive-style directory structures. We monitor the size of the output files; if a high-frequency pipeline is generating 5MB files every hour, our delivery layer buffers them and flushes them as optimal 256MB Parquet blocks.

This ensures that when the client's data engineering team mounts the bucket in Snowflake or Databricks, the data is immediately ready for high-performance querying with zero ETL required on their end.

05The Small File Problem

The most common partitioning mistake is creating too many small files. Distributed systems like Hadoop, Spark, and Athena are optimized for reading large, contiguous blocks of data (128MB to 1GB). If your partitions result in 10KB files, the overhead of opening the file, reading the header, and closing it takes longer than reading the actual data.

If your daily scrape volume is very small, partition by month instead of day. Always optimize for file size over partition granularity.

// 03 — the math

How partitioning
saves money.

Cloud data warehouses charge by data scanned. Partition pruning is the primary mechanism for controlling those costs. Here is how DataFlirt models storage efficiency for client delivery.

Data Scanned (Unpartitioned) = Stotal = Nrecords × AvgRecordSize
A full table scan. Cost scales linearly with total historical data. Standard OLAP cost model
Data Scanned (Partitioned) = Spruned = Stotal × (Partitionsqueried / Partitionstotal)
Querying one day of a 3-year dataset scans ~0.1% of the data. Partition pruning logic
Optimal Partition Size = 128 MBSize1 GB
Target file size to avoid the 'small file problem' in HDFS/S3. Parquet/Iceberg best practices
// 04 — storage layout

Writing scraped data
to an S3 data lake.

A trace of a delivery worker writing a batch of scraped e-commerce records into a Hive-style partitioned S3 bucket.

S3ParquetHive-style
edge.dataflirt.io — live
CAPTURED
// batch received from extraction layer
records: 4,281,900
schema: "ecommerce_products_v4"

// partition resolution (scrape_date, region)
partition_keys: ["scrape_date=2026-05-19", "region=IN", "region=US"]
shuffle_phase: complete // records grouped by key

// writing to object storage
write: "s3://df-lake/products/scrape_date=2026-05-19/region=IN/part-001.parquet"
file_size: 214 MB // optimal
write: "s3://df-lake/products/scrape_date=2026-05-19/region=US/part-001.parquet"
file_size: 381 MB // optimal

// catalog update
glue_catalog: "ALTER TABLE products ADD PARTITION..."
status: partitions registered
// 05 — failure modes

Where partitioning
goes wrong.

Partitioning is a double-edged sword. Choosing the wrong key or granularity creates severe performance bottlenecks. These are the most common architectural mistakes we see in client data lakes.

PIPELINES AUDITED ·  ·    150+
AVG COST REDUCTION ·  ·   68%
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Over-partitioning (Small Files)

High overhead · Partitioning by hour/minute creates thousands of 10KB files, choking the catalog.
02

Data Skew

Uneven compute · Partitioning by domain where one domain is 90% of the data causes straggler tasks.
03

Wrong Partition Key

Full scans · Partitioning by scrape_date but querying by product_id forces full table scans.
04

Unregistered Partitions

Missing data · Files exist in S3 but aren't added to the Hive/Glue catalog, making them invisible.
05

Deep Nesting

Path limits · Too many partition levels (date/country/category/brand) hits S3 prefix limits.
// 06 — DataFlirt's architecture

Write optimized,

read pruned.

At DataFlirt, we deliver scraped data directly into client data lakes. We default to Hive-style partitioning by scrape_date and target_domain, writing compressed Parquet files sized between 128MB and 512MB. For high-frequency pipelines, we run background compaction jobs to merge small intra-day files into optimal daily partitions, ensuring downstream analytics teams never inherit a small-file problem.

Delivery Partition Config

Standard partition configuration for a daily e-commerce scraping pipeline.

format Apache Parquetsnappy
partition_style Hive (key=value)
level_1_key scrape_dateYYYY-MM-DD
level_2_key domain
target_file_size 256 MB
compaction daily at 00:00 UTC
catalog_sync AWS Glue

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about partition strategies, file sizes, and managing scraped data at scale.

Ask us directly →
What is the 'small file problem'? +
When you partition too granularly (e.g., by hour or by a high-cardinality ID like product SKU), you generate millions of tiny files. Query engines like Spark or Athena spend more time reading file metadata and opening network connections than actually processing data. The fix is background compaction: merging small files into 128MB+ chunks.
Should I partition by scrape date or publication date? +
Almost always partition by scrape_date (ingestion time). Scrape date is immutable and monotonically increasing, making it perfect for append-only data lakes. Publication date can change, be null, or require backfilling, which forces you to rewrite historical partitions and complicates pipeline logic.
How does Hive-style partitioning work? +
Hive-style partitioning encodes the partition key and value directly in the directory path, like s3://bucket/data/year=2026/month=05/. This allows query engines to infer the partition schema from the file path alone, making partition discovery and pruning highly efficient without needing to read the files themselves.
What happens if I don't partition my scraped data? +
Initially, nothing. But as your dataset grows to terabytes, every query will scan the entire dataset. In AWS Athena or Google BigQuery, where you pay per terabyte scanned, a simple SELECT count(*) WHERE date = 'today' could cost $50 instead of $0.05. Partitioning is mandatory for cost control at scale.
How does DataFlirt handle partition updates for late-arriving data? +
If a scrape job fails and retries the next day, we write the data to the original scrape_date partition, not the execution date. We use Apache Iceberg for clients who need ACID transactions, allowing us to safely upsert late-arriving records without locking the partition for readers.
Is there a limit to how many partitions I can have? +
Yes. While S3 can handle infinite prefixes, the metadata catalog (like AWS Glue or Hive Metastore) will choke if a table has hundreds of thousands of partitions. Query planning times will spike. Keep total partitions under 10,000 per table by avoiding high-cardinality partition keys.
$ dataflirt scope --new-project --target=data-partitioning READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h