← Glossary / DynamoDB

What is DynamoDB?

DynamoDB is a fully managed, serverless NoSQL database provided by AWS, engineered for single-digit millisecond latency at virtually any scale. In scraping infrastructure, it acts as a high-throughput ingestion buffer, a deduplication ledger, or a distributed URL frontier. Because it scales horizontally based on partition keys, getting the schema wrong doesn't just degrade query performance — it creates hot partitions that throttle your writes and silently drop scraped records.

AWSNoSQLKey-Value StoreIngestion BufferServerless

// 02 — definitions

Scale without
servers.

A managed key-value and document database that trades complex querying capabilities for absolute, predictable performance at massive write volumes.

Ask a DataFlirt engineer →

TL;DR

DynamoDB is AWS's flagship NoSQL database. It requires you to define your access patterns before you write data, using Partition Keys and Sort Keys. For scraping pipelines, it's ideal for state tracking and deduplication, but terrible for analytical queries, aggregations, or full-text search.

01Definition & structure

DynamoDB is a fully managed NoSQL database service that supports key-value and document data structures. Data is organized into tables, items (rows), and attributes (columns). Unlike relational databases, DynamoDB is schema-less, except for the primary key. The primary key can be a simple Partition Key (used to distribute data across physical nodes) or a composite Partition Key and Sort Key (used to group and order data within a partition).

02How it works in practice

When you write an item, DynamoDB hashes the partition key to determine which physical server will store the data. This architecture allows it to scale horizontally without limits. However, it means you cannot perform SQL-like JOIN operations or complex aggregations. You must know exactly how you intend to read the data before you design the table, often relying on Global Secondary Indexes (GSIs) to support alternative access patterns.

03The hot partition problem

The most common failure mode in DynamoDB is the "hot partition." If your partition key lacks cardinality — for example, using a status field like "PENDING" or a date like "2026-05-19" — all write traffic is routed to a single physical node. That node has a hard limit of 1,000 Write Capacity Units (WCUs) per second. Once you hit that limit, DynamoDB throttles your requests, even if your table is provisioned for 100,000 WCUs overall.

04How DataFlirt handles it

We use DynamoDB primarily as a distributed state machine and deduplication ledger. Our partition keys are always high-entropy cryptographic hashes (e.g., SHA-256 of the target URL). We strictly enforce the "claim check" pattern: large payloads go to S3, and DynamoDB only holds the S3 URI and pipeline metadata. This ensures our item sizes stay under 1KB, maximizing our WCU efficiency and preventing throttling during high-concurrency crawls.

05Did you know: DynamoDB Streams

DynamoDB Streams captures a time-ordered sequence of item-level modifications in any DynamoDB table. In a scraping pipeline, you can use Streams to trigger an AWS Lambda function every time a new record is inserted. This creates a powerful, event-driven Change Data Capture (CDC) architecture, allowing you to instantly push newly scraped data to downstream analytical stores like Snowflake or Elasticsearch without polling.

// 03 — capacity math

How throughput
is provisioned.

DynamoDB bills on throughput and storage. Throughput is measured in Capacity Units. Understanding this math is critical to preventing a high-volume scraping pipeline from generating a five-figure AWS bill.

Write Capacity Unit (WCU) = WCU = items_per_sec × ceil(item_size_KB / 1)

1 WCU = 1 write per second for an item up to 1 KB. A 2.5 KB item costs 3 WCUs. AWS DynamoDB Pricing Model

Read Capacity Unit (RCU) = RCU = reads_per_sec × ceil(item_size_KB / 4)

For strongly consistent reads. Eventually consistent reads cost half as much. AWS DynamoDB Pricing Model

Hard Partition Limit = Max_Throughput = 1000 WCU or 3000 RCU

The maximum throughput a single partition key value can sustain before throttling. AWS Service Quotas

// 04 — write operations

A hot partition
throttling event.

A trace of a scraping worker attempting to batch-write extracted records to DynamoDB. The schema uses a timestamp as the partition key, causing all concurrent workers to hit the exact same physical partition.

BatchWriteItemWCU limitsThrottling

edge.dataflirt.io — live

CAPTURED

// batch write initiation
table: "df_extracted_records_v2"
operation: "BatchWriteItem"
items_count: 25
partition_key: "2026-05-19T14:00" // anti-pattern ⚠

// network request
req.bytes: 18,420
req.latency: 42ms

// response
status: 400 Bad Request
error.code: "ProvisionedThroughputExceededException"
error.message: "The level of configured provisioned throughput for the table was exceeded."

// sdk automatic retry
retry.attempt: 1
retry.delay: 50ms // exponential backoff
unprocessed_items: 25
pipeline.status: backpressure applied

// 05 — failure modes

Where DynamoDB
pipelines break.

Ranked by frequency of occurrence in scaling data pipelines. DynamoDB rarely fails at the infrastructure level; it fails because the schema design clashes with the physical realities of distributed storage.

COMMON ERROR · · · · Throttling

COST DRIVER · · · · Item Size

UPDATED · · · · · · 2026-05-19

01

Hot partitions

throughput killer · Sequential IDs or timestamps as partition keys

02

Large item sizes

cost multiplier · Storing raw HTML instead of S3 pointers

03

Scan operations

latency spike · Treating it like SQL and scanning the table

04

GSI write amplification

cost multiplier · Too many Global Secondary Indexes

05

Connection exhaustion

client side · Not reusing HTTP keep-alive in the AWS SDK

// 06 — our architecture

Fast ingestion,

offloaded storage.

We never store raw scraped HTML in DynamoDB. At $1.25 per million write request units, storing 100KB payloads destroys unit economics and burns through WCU budgets. Instead, we use the claim check pattern: raw payloads are streamed directly to S3, and DynamoDB stores only the metadata, extraction status, and the S3 URI. This keeps item sizes under 1KB, maximizing write throughput while keeping the pipeline state globally accessible in single-digit milliseconds.

dynamo_ingest_state

Live metrics from a URL frontier table managing a 10M-page crawl.

table.name df_url_frontier_prod

billing.mode PAY_PER_REQUESTon-demand

item.avg_size 340 bytes< 1KB

wcu.consumed 4,200 / sec

throttled_requests 0.00%

stream.status ENABLEDtriggering CDC

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About NoSQL schema design, throughput provisioning, cost management, and how DataFlirt uses DynamoDB at scale.

Ask us directly →

Should I use DynamoDB or MongoDB for my scraping pipeline? +

If you need flexible querying, ad-hoc aggregations, or complex document structures, use MongoDB. If you need absolute predictable latency at massive scale, zero maintenance, and your access patterns are strictly key-value (like a URL deduplication ledger), use DynamoDB. DynamoDB forces you to design your schema around your queries; MongoDB is more forgiving but requires infrastructure management.

Why am I getting ProvisionedThroughputExceeded exceptions? +

You are either exceeding your table's total provisioned capacity, or you have a "hot partition." If your partition key is a date (e.g., "2026-05-19"), all writes for that day hit the same physical server, which hard-caps at 1,000 WCUs. Use a high-cardinality partition key like a URL hash to distribute writes evenly.

Can I query DynamoDB by date if my partition key is a URL hash? +

Not directly without a full table Scan, which is slow and expensive. You must create a Global Secondary Index (GSI) where the partition key is a known bucket (e.g., year-month) and the sort key is the exact timestamp. This allows you to query the GSI efficiently.

How does DataFlirt handle deduplication with DynamoDB? +

We use conditional writes. When a worker discovers a URL, it attempts a PutItem with a ConditionExpression: attribute_not_exists(url_hash). If the item already exists, DynamoDB rejects the write, and the worker drops the URL. This guarantees distributed, atomic deduplication without locking.

Is it safe to store PII in DynamoDB? +

Yes, provided you configure it correctly. DynamoDB encrypts all data at rest by default using AWS KMS. For GDPR/CCPA compliance, you can use customer-managed keys and implement item-level or attribute-level encryption in your application layer before writing to the database.

Should I use On-Demand or Provisioned capacity? +

For unpredictable, spiky scraping workloads, use On-Demand. It scales instantly and you pay per request. For continuous, steady-state pipelines (like a 24/7 news crawler), use Provisioned capacity with Auto-Scaling. Provisioned is significantly cheaper per request, but only if your utilization remains consistently high.

$ dataflirt scope --new-project --target=dynamodb READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h