← Glossary / Read Replica

What is Read Replica?

Read replica is a read-only copy of your primary database that continuously synchronises data from the master node. In scraping infrastructure, it separates the heavy write workloads of ingestion pipelines from the complex read queries of downstream analytics and data delivery. Without it, a massive batch insert from a concurrent crawl can lock tables, exhaust connection pools, and bring your entire client-facing API to a halt.

DatabasesPostgreSQLReplication LagQuery OffloadingHigh Availability

// 02 — definitions

Split the
workload.

How separating writes from reads prevents high-throughput scraping pipelines from suffocating your downstream data consumers.

Ask a DataFlirt engineer →

TL;DR

A read replica receives a continuous stream of transaction logs from the primary database and applies them locally. It handles SELECT queries exclusively, freeing the primary to focus on INSERTs and UPDATEs. In AWS RDS or self-hosted PostgreSQL, this is the standard architecture for scaling out read-heavy workloads without sharding the entire cluster.

01Definition & structure

A read replica is a database instance that maintains a continuous, read-only copy of a primary database. The primary node handles all write operations (INSERT, UPDATE, DELETE) and streams its Write-Ahead Log (WAL) to the replica. The replica applies these changes locally and serves SELECT queries. This architecture allows you to scale read capacity horizontally by adding more replicas, without adding load to the primary node.

02Synchronous vs Asynchronous

Replication can be synchronous or asynchronous. Asynchronous is the default for scraping workloads: the primary commits the write immediately and streams it to the replica in the background. This keeps ingestion fast but introduces a slight delay (lag) before the data is visible on the replica. Synchronous replication forces the primary to wait for the replica's acknowledgment before committing, ensuring zero data loss but severely degrading write throughput.

03The replication lag problem

Replication lag is the time difference between a record being written to the primary and becoming available on the replica. During heavy scraping batch inserts, the primary can generate WAL faster than the replica's single-threaded replay process can apply it. If an API client queries the replica during this spike, they will receive stale data. Managing this lag is critical for data consistency.

04How DataFlirt handles it

We strictly enforce Command Query Responsibility Segregation (CQRS) at the infrastructure level. Our scraping workers write exclusively to the primary PostgreSQL nodes. All client-facing APIs, internal Grafana dashboards, and data export jobs are routed through PgBouncer to a pool of read replicas. If a replica's lag exceeds our 500ms SLO, our load balancer automatically removes it from the read pool until it catches up.

05The split-brain misconception

A common fear is "split-brain," where a network partition causes a replica to think the primary is dead and promote itself, resulting in two primaries accepting writes. Modern consensus tools (like Patroni or etcd) prevent this by requiring a quorum lock. A replica cannot promote itself unless it secures the majority vote from the cluster configuration store, ensuring only one node ever accepts writes.

// 03 — replication math

How stale is
the replica?

Replication lag is the time difference between a commit on the primary and its availability on the replica. DataFlirt monitors this continuously to ensure data delivery feeds don't export partial or stale records during heavy ingestion spikes.

Replication Lag (Time) = L = T_{replica_apply} − T_{primary_commit}

Time delay. Usually sub-millisecond, but spikes during heavy batch inserts. PostgreSQL pg_stat_replication

Byte Lag (WAL) = B = LSN_primary − LSN_replica

Log Sequence Number difference. Measures how much data is in flight. PostgreSQL pg_wal_lsn_diff()

Max Read Throughput = Q = N_replicas × Capacity_node

Read capacity scales linearly with N replicas, assuming load balancing. System Design 101

// 04 — pg_stat_replication

Monitoring lag
during a batch insert.

A live trace of PostgreSQL replication stats during a 500k record bulk insert from a scraping pipeline. The replica falls behind momentarily before catching up.

PostgreSQL 16walreceiverasync

edge.dataflirt.io — live

CAPTURED

// primary node status
select client_addr, state, sync_state from pg_stat_replication;
client_addr: 10.0.1.42 state: streaming sync_state: async

// pipeline initiates bulk insert (500k rows)
query: COPY scraped_records FROM STDIN;
wal_write_rate: 142 MB/s

// checking byte lag
select pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as byte_lag;
byte_lag: 45,812,904 // ~45 MB behind

// checking time lag
select now() - pg_last_xact_replay_timestamp() as time_lag;
time_lag: 00:00:04.210 // 4.2 seconds stale

// insert completes, replica catches up
wal_write_rate: 1.2 MB/s
time_lag: 00:00:00.015 // 15ms stale

// 05 — lag factors

What causes
replication lag.

Ranked by frequency of occurrence in high-throughput scraping databases. Network latency is rarely the bottleneck; disk I/O and lock contention dominate.

AVG LAG (IDLE) · · · < 20ms

AVG LAG (LOAD) · · · 2–5s

UPDATED · · · · · · 2026-05-19

01

Long-running read queries

blocks apply · Analytical queries on the replica block WAL replay

02

Massive batch UPDATEs

WAL bloat · Updating 1M rows generates massive WAL traffic

03

Disk I/O saturation

hardware · Replica disk cannot write as fast as primary

04

Network bandwidth

throughput · Cross-region replication hits bandwidth caps

05

CPU exhaustion

compute · Single-threaded WAL replay maxes out a core

// 06 — our architecture

Write heavy,

read instantly.

DataFlirt's storage layer isolates ingestion from extraction. Our primary PostgreSQL nodes handle raw HTML metadata and parsed record inserts exclusively. All downstream transformations, schema validations, and client API requests are routed to a pool of read replicas via PgBouncer. This guarantees that a sudden spike in crawl concurrency never degrades the performance of our data delivery endpoints.

pg_stat_replication

Live replication status of a DataFlirt production cluster.

primary.node db-main-01

replica.count 3 active

sync_state asyncquorum

wal_write_rate 42.5 MB/s

max_time_lag 18ms

max_byte_lag 2.1 MB

cluster.status healthy

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About replication lag, synchronous vs asynchronous modes, and how DataFlirt scales database reads for high-volume scraping pipelines.

Ask us directly →

What is the difference between synchronous and asynchronous replication? +

In synchronous replication, the primary waits for the replica to confirm it has written the data before acknowledging the commit to the client. This guarantees zero data loss but adds latency to every write. Asynchronous replication (the default) acknowledges the commit immediately and sends the data to the replica in the background. It's faster, but a primary crash could lose milliseconds of data.

Can I write data to a read replica? +

No. By definition, a read replica is in read-only mode. If you attempt an INSERT, UPDATE, or DELETE, the database engine will reject the query. If you need to scale writes, you need a multi-master setup, sharding, or a distributed database like CockroachDB or Citus.

How does DataFlirt handle replication lag during data delivery? +

We monitor the pg_stat_replication lag metrics continuously. If a client's data delivery job (e.g., an S3 export) is scheduled, the job runner checks the replica's lag. If the lag exceeds 500ms, the job pauses until the replica catches up, ensuring no partial or stale records are exported.

When should I add a read replica instead of scaling up the primary? +

Scale up (more CPU/RAM) when your write throughput is bottlenecking or your working set no longer fits in memory. Add a read replica when your CPU is maxed out by complex SELECT queries (like aggregations, joins, or API reads) while your write volume is relatively stable.

What happens if the primary database goes down? +

In a managed environment (like AWS RDS or Patroni), a read replica can be promoted to become the new primary. This failover process typically takes 30–60 seconds. Once promoted, it begins accepting writes, and new replicas must be spun up to replace it.

Does a read replica help with database backups? +

Running a heavy pg_dump or taking a snapshot on the primary can cause severe I/O spikes and degrade ingestion performance. We run all our daily logical backups against a dedicated read replica, completely isolating the backup load from the production write path.

$ dataflirt scope --new-project --target=read-replica READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h