← Glossary / Upsert Operation

What is Upsert Operation?

Upsert Operation is a database command that either inserts a new record if it doesn't exist, or updates an existing record if a matching primary key is found. In scraping pipelines, it is the foundational mechanism for maintaining stateful datasets without duplicating rows or dropping historical context. When your pipeline runs daily against a catalog of millions of items, an efficient upsert strategy is the difference between a clean, versioned dataset and a bloated data lake full of redundant records.

Data EngineeringETLState ManagementIdempotencyDatabase Operations
// 02 — definitions

Merge without
duplication.

The mechanics of reconciling fresh scraped data against an existing historical dataset, ensuring idempotency across pipeline runs.

Ask a DataFlirt engineer →

TL;DR

An upsert (update + insert) guarantees that running the same extraction job twice won't duplicate data. It relies on a deterministic primary key — like a product ID, SKU, or canonical URL — to decide whether to append a new row or overwrite an existing one. It is the core operation of incremental scraping.

01Definition & structure

An upsert operation is a hybrid database command that merges new data into an existing table. It requires a unique constraint (a primary key). When the command executes, the database engine checks the index for the key. If the key is missing, the row is inserted. If the key exists, the row is updated with the new values.

In SQL, this is typically implemented as INSERT ... ON CONFLICT DO UPDATE (PostgreSQL/SQLite) or MERGE INTO (Snowflake/BigQuery). It is the fundamental operation that allows scraping pipelines to run continuously without corrupting the destination dataset.

02How it works in practice

When a scraper extracts a product page, it yields a JSON record. The pipeline batches these records (e.g., 10,000 at a time) and sends an upsert command to the data warehouse. The database uses the product_id to reconcile the batch. If the price changed, the existing row's price column is updated. If it's a newly discovered product, a fresh row is created.

Crucially, a well-designed upsert also includes a hash comparison (e.g., WHERE target.hash != EXCLUDED.hash) to prevent the database from writing an update when the scraped data is identical to the stored data, saving massive amounts of I/O.

03The cost of naive inserts

Many junior developers start by simply appending scraped data to a CSV or a database table. If you scrape 100,000 products daily, an append-only strategy results in 3 million rows by the end of the month, where 95% of the data is redundant. Querying the "current state" requires complex window functions (ROW_NUMBER() OVER (PARTITION BY id ORDER BY scraped_at DESC)), which cripples analytical query performance. Upserts solve this by maintaining a clean, materialized view of the current state.

04How DataFlirt handles it

We decouple extraction from reconciliation. Our edge workers extract the data and compute a deterministic SHA-256 hash of the payload. This hash is checked against a high-speed Redis cluster. If the hash hasn't changed since the last scrape, the record is discarded immediately. We only forward actual state changes (inserts and updates) to the client's data warehouse. This "edge-merge" architecture reduces client database compute costs by orders of magnitude.

05Soft deletes vs hard deletes

Upserts cannot delete data. If a product is removed from a website, the scraper simply won't extract it, meaning no upsert command is generated for that ID. To handle this, pipelines use a last_seen_at timestamp. Every upsert updates this timestamp. A post-processing job then flags any records where last_seen_at is older than the current crawl window as inactive. This preserves the historical data while accurately reflecting the current site state.

// 03 — the math

The cost of
reconciliation.

Upserts are computationally heavier than append-only inserts. DataFlirt models the I/O cost of reconciliation to determine whether to upsert at the database layer or merge in memory before writing.

Upsert I/O Cost = C = Index_Seek + (Pupdate × Writeupdate) + (Pinsert × Writeinsert)
Every upsert requires a read (index seek) before a write. Missing indexes cause full table scans. Database Engine Mechanics
Dataset Churn Rate = Rchurn = Recordsupdated / Recordstotal
If churn is < 5%, pushing full datasets to the warehouse wastes compute. DataFlirt pipeline metrics
DataFlirt Merge Efficiency = E = 1 − (Byteswritten / Bytesscraped)
We drop unchanged records at the edge. E > 0.95 on typical e-commerce targets. Internal SLO
// 04 — pipeline trace

Reconciling 2.4M
product records.

A live trace of a daily incremental scrape merging into a PostgreSQL target via an ON CONFLICT DO UPDATE clause.

PostgreSQLON CONFLICTBatch Upsert
edge.dataflirt.io — live
CAPTURED
// load batch
batch.size: 50,000
batch.keys: ["sku_10492", "sku_10493", "sku_10494"...]

// execute upsert
sql: "INSERT INTO products (sku, price, stock, hash) VALUES ... ON CONFLICT (sku) DO UPDATE SET price = EXCLUDED.price, stock = EXCLUDED.stock WHERE products.hash != EXCLUDED.hash"

// results
rows.inserted: 1,240 // new products discovered
rows.updated: 8,412 // price or stock changed
rows.unchanged: 40,348 // skipped via hash comparison

// performance
duration: 1.42s
throughput: 35,211 rows/sec
status: COMMITTED
// 05 — bottlenecks

Where upserts
choke at scale.

Ranked by frequency of performance degradation in client data warehouses. Naive row-by-row upserts fail quickly; batching and indexing are mandatory for scraping workloads.

PIPELINES MONITORED ·   300+ active
AVG BATCH SIZE ·  ·  ·    10,000 rows
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Missing or fragmented indexes

I/O bound · Causes full table scans on every conflict check
02

Row-by-row execution

Network bound · The N+1 query problem instead of bulk merging
03

Unnecessary updates

Compute bound · Overwriting rows where data hasn't actually changed
04

Dead tuple bloat (MVCC)

Storage bound · PostgreSQL bloat from high-frequency updates
05

Lock contention

Concurrency · Parallel workers trying to upsert the same keys
// 06 — our architecture

Never write the same byte twice,

hashing at the edge to save warehouse compute.

DataFlirt doesn't blindly push upserts to your warehouse. We compute a deterministic hash of the extracted record at the edge and compare it against a fast Redis cache of the previous state. If the hash matches, the record is dropped before it ever hits your database. We only send the true delta — the inserts and the actual updates — reducing your warehouse compute costs by up to 90% on low-churn targets.

delta-merge.job

Metrics from a daily catalog reconciliation job.

records.scraped 2,450,000
edge.cache_hits 2,310,400unchanged
delta.inserts 42,100
delta.updates 97,500
warehouse.writes 139,600 rows94% reduction
compute.saved ~$42.00 / run

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About database reconciliation, primary key selection, handling deleted items, and how DataFlirt optimises incremental delivery.

Ask us directly →
What is the difference between an upsert and an insert? +
An insert blindly appends a new row. If you run a scraper twice and use insert, you get duplicate records. An upsert checks if a record with the same primary key already exists. If it does, it updates the existing row; if it doesn't, it inserts a new one. Upserts make your pipeline idempotent.
How do you handle items that disappear from the target site? +
Upserts only handle inserts and updates. To handle deletions, you need a "soft delete" strategy. We track the last_seen_at timestamp during the upsert. After the crawl finishes, any record where last_seen_at is older than the current crawl start time is marked as is_active = false. We never hard-delete historical data.
What makes a good primary key for scraping upserts? +
A stable, deterministic identifier provided by the target. SKUs, product IDs, or article IDs are ideal. If those aren't available, the canonical URL is the next best choice. Never use auto-incrementing integers or volatile attributes (like titles that might fix a typo) as your upsert key, or you will create duplicates.
How does DataFlirt handle schema changes during an upsert? +
If a target adds a new field, our extraction layer flags the schema drift. If the client approves the new schema, we run an `ALTER TABLE` to add the column, then resume the upsert. The new field will populate for updated rows, while older rows will hold a null value until they are re-scraped.
Why is my database slowing down with daily upserts? +
In MVCC databases like PostgreSQL, an update is actually a soft-delete of the old row and an insert of a new row. If you update millions of rows daily without changing the data (because you aren't hashing to check for actual diffs), you create massive "dead tuple" bloat. This destroys index performance until a vacuum runs.
Can I upsert directly into a data lake? +
Yes, using modern table formats like Apache Iceberg, Delta Lake, or Apache Hudi. They support ACID transactions and `MERGE INTO` commands on top of raw Parquet files in S3. DataFlirt natively delivers incremental upserts to these formats, bypassing the need for an intermediate relational database.
$ dataflirt scope --new-project --target=upsert-operation READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h