← Glossary / Data Sharding

What is Data Sharding?

Data sharding is the architectural practice of horizontally partitioning a single logical dataset across multiple independent database nodes. For scraping pipelines generating terabytes of raw HTML and structured JSON daily, a monolithic database quickly hits I/O and compute ceilings. Sharding distributes the write-heavy load of continuous ingestion and the read-heavy load of downstream analytics, ensuring that a spike in crawl concurrency doesn't cause a database lockup that cascades back to the scraping fleet.

Data EngineeringHorizontal ScalingDistributed SystemsHigh ThroughputDatabase Architecture
// 02 — definitions

Split the
load.

Why writing 50 million scraped records a day to a single Postgres instance is a recipe for pipeline collapse, and how distributed storage solves it.

Ask a DataFlirt engineer →

TL;DR

Data sharding breaks a massive database into smaller, manageable pieces called shards, each hosted on a separate server. It's the primary scaling mechanism for high-throughput scraping operations, allowing write capacity to scale linearly by routing incoming records to specific nodes based on a shard key like target domain or timestamp.

01Definition & structure
Data sharding is a database architecture pattern where a single logical dataset is broken into smaller, independent chunks called shards. Each shard is hosted on a separate physical or virtual server. Unlike replication (which copies the same data everywhere), sharding distributes unique rows across the cluster. A routing layer sits in front of the database nodes, inspecting incoming queries and directing them to the correct shard based on a deterministic shard key.
02How it works in practice
When a scraping pipeline extracts a record, it sends an INSERT query to the database proxy. The proxy evaluates the record's shard key (e.g., a hash of the URL), calculates which node owns that hash range, and forwards the write. For reads, if the query specifies the URL, the proxy fetches it directly from the owning node. If the query asks for "all records from yesterday," the proxy broadcasts the query to all nodes, waits for their partial results, merges them, and returns the final dataset to the client.
03The hotspot problem
The most critical decision in sharding is the shard key. If you shard by country, and 80% of your scraped targets are in the US, the node holding the US shard will be overwhelmed while the others sit idle. This is called a hotspot. Effective sharding requires a key with high cardinality and uniform distribution, which is why composite keys (like hash(domain_id + timestamp)) are standard practice in data engineering.
04How DataFlirt handles it
We abstract the storage complexity entirely. Our ingestion pipelines write to a distributed message queue (Kafka), which acts as a buffer. Dedicated consumer workers then batch-insert records into our sharded ClickHouse clusters using consistent hashing. This architecture allows our scraping fleet to operate at maximum concurrency without ever waiting on database locks, while ensuring client data is immediately available for low-latency analytical queries.
05Did you know?
The term "shard" in the context of databases was popularized by the MMORPG Ultima Online in the late 1990s. To handle player load, the developers created parallel copies of the game world and justified it in the game's lore by saying the world had been shattered into "shards" by an evil wizard. The terminology stuck and became a foundational concept in distributed systems.
// 03 — the routing math

How records find
their shard.

The shard key determines data distribution. A poor key creates hotspots where one node takes 90% of the writes. DataFlirt uses consistent hashing to ensure uniform distribution across our storage clusters.

Hash-based routing = Shard_ID = hash(shard_key) % N
Distributes load evenly, but makes cluster resizing complex when N changes. Standard distributed systems
Consistent hashing = Node = ring_lookup(hash(shard_key))
Minimizes data movement when adding or removing database nodes. Dynamo paper, 2007
Write throughput = Wtotal = Σ (Wshard_i) − network_overhead
Linear scaling of write capacity, bounded only by the routing proxy. DataFlirt cluster sizing model
// 04 — shard router trace

Routing 10k inserts/sec
across the cluster.

A live trace from DataFlirt's ingestion gateway, routing incoming scraped product records to a 16-node ClickHouse cluster using a composite shard key.

ClickHousehash routingwrite-heavy
edge.dataflirt.io — live
CAPTURED
// inbound batch
batch.size: 5,000 records
batch.source: "pipeline-ecommerce-IN"

// shard key evaluation
key.composite: [domain_id, crawl_timestamp]
hash.algorithm: "cityhash64"
router.strategy: "consistent_hash_ring"

// distribution map
shard_04: 1,240 records // hash range 0x3...
shard_07: 1,180 records // hash range 0x5...
shard_11: 1,310 records // hash range 0x9...
shard_15: 1,270 records // hash range 0xE...

// execution
node_04.status: 200 OK latency: 12ms
node_07.status: 200 OK latency: 14ms
node_11.status: 200 OK latency: 11ms
node_15.status: 200 OK latency: 15ms

// commit
transaction.state: COMMITTED
throughput.current: 14,200 rows/sec
// 05 — sharding failure modes

Where distributed
storage breaks.

Ranked by frequency of occurrence in large-scale scraping infrastructure. The hardest part of sharding isn't splitting the data — it's querying it back together without scanning every node.

CLUSTERS MONITORED ·  ·   300+ active
WINDOW ·  ·  ·  ·  ·  ·   30d trailing
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Hotspotting (uneven distribution)

% of incidents · Shard key skew causes one node to hit 100% CPU
02

Cross-shard joins

% of incidents · Querying across nodes causes massive network I/O
03

Cluster resizing overhead

% of incidents · Rebalancing data when adding nodes takes hours
04

Distributed transaction failure

% of incidents · Partial commits leave data in an inconsistent state
05

Router bottleneck

% of incidents · The proxy directing traffic becomes a single point of failure
// 06 — DataFlirt's storage layer

Write locally,

query globally.

DataFlirt handles petabytes of scraped data by decoupling compute from storage and using aggressive time-based and domain-based sharding. We don't force clients to worry about shard keys. Our ingestion layer automatically routes raw HTML to blob storage and structured records to a sharded columnar database. When you query your dataset via our API, a distributed query engine pushes the compute down to the specific shards holding your data, returning aggregated results in milliseconds.

Cluster health metrics

Live telemetry from a dedicated client storage cluster.

cluster.nodes 16 active
total.volume 42.8 TB
write.throughput 18,500 rows/sec
shard.skew 4.2%
cross_shard.queries 12%
rebalance.status idle
node_09.cpu 88%

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About database scaling, shard key selection, query performance, and how DataFlirt manages distributed storage for scraping pipelines.

Ask us directly →
What is the difference between sharding and partitioning? +
Partitioning splits a large table into smaller tables within the same database instance (e.g., by month). Sharding splits the data across multiple independent database servers. Partitioning helps with query performance and maintenance; sharding scales raw compute and I/O capacity.
How do I choose a good shard key for scraped data? +
A good shard key has high cardinality and an even distribution. For scraping, target_domain is often a terrible key because you might scrape 1M pages from Amazon and 100 from a local shop, creating a massive hotspot. A composite key like hash(target_domain + product_id) ensures uniform distribution.
Does sharding make querying slower? +
It depends on the query. If your query includes the shard key, the router sends it to exactly one node — it's incredibly fast. If your query doesn't include the shard key (a "scatter-gather" query), the router must query every shard and aggregate the results, which introduces network latency and overhead.
How does DataFlirt handle cross-shard joins? +
We avoid them at the ingestion layer by denormalizing scraped data before it hits the sharded cluster. Every record is written as a wide, flat row containing all necessary context. If joins are absolutely required for downstream analytics, we use distributed query engines like Trino to handle the shuffle efficiently.
What happens if a shard goes down during a scrape? +
High-availability sharded clusters use replication. Each shard has a primary node and at least one replica. If a primary fails, the router automatically fails over to the replica. The scraping pipeline's retry queue holds any in-flight writes during the 2-3 second failover window, ensuring zero data loss.
When should I actually implement sharding? +
As late as possible. Modern single-node databases like Postgres can handle terabytes of data and thousands of writes per second with proper tuning and NVMe drives. Shard only when you have exhausted vertical scaling (buying a bigger server) and your write throughput or storage volume physically exceeds single-node limits.
$ dataflirt scope --new-project --target=data-sharding READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h