← Glossary / Upsert (INSERT ON CONFLICT)

What is Upsert (INSERT ON CONFLICT)?

Upsert (INSERT ON CONFLICT) is a database operation that attempts to insert a new record, but if a conflict occurs on a unique constraint (like a primary key), it updates the existing record instead. In web scraping pipelines, it is the fundamental mechanism for maintaining stateful datasets without duplicating rows or requiring expensive read-before-write checks. Without upserts, incremental scraping jobs quickly degrade into a mess of primary key violations or bloated append-only tables.

DatabasesIdempotencyPostgreSQLData EngineeringStateful Pipelines
// 02 — definitions

Write once,
update if needed.

The atomic operation that makes incremental scraping pipelines idempotent, preventing duplicate records without the latency of checking existence first.

Ask a DataFlirt engineer →

TL;DR

An upsert combines an INSERT and an UPDATE into a single atomic transaction. In PostgreSQL, this is handled via the INSERT ... ON CONFLICT DO UPDATE syntax. It allows scraping pipelines to blindly push extracted records to the database, trusting the storage layer to handle deduplication and state updates efficiently.

01Definition & structure
An upsert is a database operation that conditionally inserts a row if it doesn't exist, or updates it if it does. In PostgreSQL, this is implemented using the INSERT ... ON CONFLICT DO UPDATE syntax. It requires a unique constraint or index on the target table to determine what constitutes a "conflict." When the database engine detects a constraint violation during the insert phase, it seamlessly pivots to executing the update logic defined in the query.
02How it works in practice
When a scraper pushes a batch of records, the database evaluates each row against the unique index. If the key (e.g., a product SKU) is new, the row is inserted. If the key already exists, the database exposes the incoming data via the special EXCLUDED table alias. The query then maps the EXCLUDED values to the existing row's columns. This entire process happens atomically within the database engine, ensuring no race conditions occur between concurrent scraping workers.
03The Read-Modify-Write Anti-Pattern
Novice pipeline builders often implement idempotency in application code: query the database to see if the record exists, then issue either an INSERT or an UPDATE. This is the read-modify-write anti-pattern. It doubles the network latency per record and introduces race conditions if two workers process the same entity simultaneously. Upserts push this logic down to the storage layer, where it belongs, utilizing row-level locks to guarantee consistency at maximum throughput.
04How DataFlirt handles it
We treat every data delivery as an idempotent event. Our delivery connectors automatically map your schema's primary keys to the appropriate upsert dialect for your destination warehouse. To optimize performance and reduce dead tuple bloat, our generated queries include WHERE clauses that skip the update entirely if the incoming data is identical to the existing row (e.g., the price hasn't changed). This drastically reduces disk I/O on the target database.
05Did you know: sequence exhaustion
In PostgreSQL, if your table uses a SERIAL or IDENTITY column for its primary key, an upsert that results in an update still consumes a sequence number. The database increments the sequence in anticipation of the insert, hits the conflict, and discards the number. On high-volume scraping pipelines doing millions of updates a day, this can rapidly exhaust a 32-bit integer sequence, causing the database to halt when it hits the 2.1 billion limit. Always use BIGINT for auto-incrementing IDs on upsert-heavy tables.
// 03 — the performance model

Why read-before-write
kills pipelines.

Upserts eliminate the network round-trip of checking if a record exists before writing it. DataFlirt's delivery layer relies entirely on upsert semantics to stream millions of records per hour into client warehouses.

Read-Modify-Write Latency = L = Tread + Tapp + Twrite
Two network round-trips per record. Unscalable for high-throughput scraping. Anti-pattern
Upsert Latency = L = Tupsert
One network round-trip. The database engine handles the conflict resolution locally. Standard practice
Pipeline Throughput = R = Batch_Size / L
Batching upserts yields 10x to 50x throughput gains over single-row operations. DataFlirt delivery metrics
// 04 — the query trace

Resolving conflicts
at the storage layer.

A trace of a batch upsert operation pushing scraped product pricing into a PostgreSQL database. The unique constraint is on the product SKU.

PostgreSQL 16Batch ExecutionIdempotent
edge.dataflirt.io — live
CAPTURED
-- Incoming scraped batch: 3 records
BEGIN;
INSERT INTO products (sku, price, last_seen)
VALUES
('SKU-101', 49.99, '2026-05-19T10:00:00Z'),
('SKU-102', 89.50, '2026-05-19T10:00:00Z'),
('SKU-103', 12.00, '2026-05-19T10:00:00Z')
ON CONFLICT (sku)
DO UPDATE SET
price = EXCLUDED.price,
last_seen = EXCLUDED.last_seen
WHERE products.price != EXCLUDED.price;
COMMIT;

-- Execution result
INSERT 0 3 // 1 inserted, 1 updated, 1 skipped (no price change)
// 05 — failure modes

Where upserts
break down.

Upserts are powerful but require strict schema discipline. Ranked by frequency of occurrence in client delivery pipelines.

DELIVERY JOBS ·  ·  ·  ·  12,000+ daily
DB DIALECTS ·  ·  ·  ·    Postgres, Snowflake, BQ
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Missing unique constraints

Syntax error · Without a constraint, ON CONFLICT throws an immediate error.
02

Dead tuple bloat

Storage cost · Frequent updates bloat Postgres tables; requires aggressive autovacuuming.
03

Partial data overwrites

Data loss · Nulling out existing fields because the scraper missed them.
04

Lock contention

Deadlocks · Concurrent upserts on the same rows cause transaction deadlocks.
05

Sequence exhaustion

ID limits · Failed inserts still increment serial IDs, burning through integers.
// 06 — delivery architecture

Idempotent by design,

streaming updates without duplicates.

DataFlirt's delivery layer assumes every scrape job might run twice, fail halfway, or overlap with another worker. We never use standard inserts. Every record pushed to a client's Snowflake, BigQuery, or Postgres instance is an upsert keyed on a deterministic hash of the target URL and entity ID. This guarantees that no matter how chaotic the extraction environment gets, the final dataset remains perfectly deduplicated and accurate.

Delivery Job Status

Metrics from a live batch upsert into a client's PostgreSQL warehouse.

target.table public.real_estate_listings
batch.size 10,000 records
constraint.key listing_id
rows.inserted 1,240new
rows.updated 8,760refreshed
deadlocks 0clean
latency.p95 142ms

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About upsert semantics, database performance, partial updates, and how DataFlirt handles stateful data delivery.

Ask us directly →
What is the difference between UPSERT and MERGE? +
MERGE is the ANSI SQL standard command for conditional inserts and updates, widely used in Snowflake, BigQuery, and SQL Server. UPSERT is a colloquial term, usually referring to PostgreSQL's specific INSERT ... ON CONFLICT syntax. They achieve the same idempotent result, but the syntax and underlying locking mechanisms differ by database engine.
Why not just delete the old records and insert the new ones? +
The "delete-and-replace" pattern destroys foreign key relationships, causes massive index churn, and ruins transaction isolation for concurrent readers. If a read query hits the table between the delete and the insert, it sees missing data. Upserts maintain row identity and ensure readers always see a consistent state.
How does an upsert handle fields the scraper failed to extract? +
If you blindly upsert a record with missing fields, you will overwrite good historical data with NULLs. To prevent this, use COALESCE(EXCLUDED.field, table.field) in your update clause, or add a WHERE EXCLUDED.field IS NOT NULL condition. This ensures partial scrapes only update the fields they successfully extracted.
Does DataFlirt handle the upsert logic for me? +
Yes. Our delivery connectors automatically generate the correct dialect for your destination — whether that's a Postgres ON CONFLICT, a Snowflake MERGE, or a BigQuery update. You define the primary key in the schema contract, and we handle the idempotency layer.
Why is my Postgres database growing so fast when I only do upserts? +
PostgreSQL uses Multi-Version Concurrency Control (MVCC). An update doesn't overwrite the row in place; it writes a new version of the row and marks the old one as dead. High-frequency upserts create massive "dead tuple" bloat. You must tune your autovacuum settings to run more aggressively on tables receiving constant scraper updates.
Can I upsert based on multiple columns? +
Yes. You can define a composite unique constraint across multiple columns (e.g., domain and sku). The ON CONFLICT (domain, sku) clause will perfectly handle compound keys, which is highly recommended for multi-tenant scraping datasets where IDs might overlap across different target sites.
$ dataflirt scope --new-project --target=upsert-(insert-on-conflict) READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h