← Glossary / Soft Delete

What is Soft Delete?

Soft delete is a database pattern where records are marked as inactive rather than physically removed from the disk. In scraping pipelines, it's the standard mechanism for handling products that go out of stock, listings that expire, or articles that are taken down. Instead of issuing a destructive DELETE statement that breaks downstream foreign keys and historical analytics, the pipeline updates a deleted_at timestamp, preserving the data's lineage while filtering it from active views.

Data EngineeringState ManagementHistorical DataETLData Retention
// 02 — definitions

Hide it,
don't drop it.

The mechanics of preserving historical scraped data while keeping active datasets clean and performant.

Ask a DataFlirt engineer →

TL;DR

Soft deleting updates a boolean flag or timestamp (like <code>deleted_at</code>) instead of running a SQL <code>DELETE</code>. For scraping pipelines, this is critical: a 404 on a product page doesn't mean the product never existed, it just means it's gone today. Soft deletes preserve that history for pricing models and audit trails.

01Definition & structure
A soft delete is a logical operation that marks a database record as inactive without executing a physical DELETE command. Structurally, this is implemented by adding a deleted_at timestamp column to the table. When a record needs to be removed, the application updates this column with the current time. All read queries are then modified to append WHERE deleted_at IS NULL, effectively hiding the record from the active application state while preserving it on disk.
02How it works in scraping pipelines
In a web scraping context, a soft delete is triggered when a previously known URL returns a 404, or when an item disappears from a listing page. Instead of dropping the item from the database, the pipeline issues an UPDATE. This is crucial because scraped data is inherently volatile. A product might be temporarily out of stock, or a server might be misconfigured. Soft deleting ensures that if the item returns, its historical data (like past price changes) remains intact and linked.
03The foreign key problem
Hard deletes are destructive to relational integrity. If you hard-delete a product, you must also delete (or nullify) every price history record, review, and inventory log associated with that product's ID. Soft deletes solve this by keeping the parent record alive. The foreign keys remain valid, allowing analysts to run historical queries (e.g., "what was the average price of all products, active or inactive, in Q3?") without encountering orphaned data.
04How DataFlirt handles it
We enforce soft deletes by default on all managed delivery sinks. Our ETL pipelines never issue a DELETE statement to a client's production table. Instead, we manage state via deleted_at and last_seen_at timestamps. To prevent downstream analysts from having to write complex queries, we automatically provision materialized views that filter out soft-deleted records, providing a clean "active only" interface while keeping the raw historical data available in the base tables.
05Did you know: the performance trap
Soft deletes can silently destroy database performance. If a table grows to 100 million rows, but 90 million are soft-deleted, a simple SELECT COUNT(*) WHERE deleted_at IS NULL will still scan all 100 million rows unless properly indexed. The solution is a partial index (e.g., in PostgreSQL: CREATE INDEX active_idx ON table (id) WHERE deleted_at IS NULL), which keeps the index size small and query times fast regardless of how large the "graveyard" gets.
// 03 — the logic

How views filter
the graveyard.

Soft deletes require every downstream query to be aware of the deletion state. DataFlirt uses materialized views to abstract this complexity away from the client's reporting layer.

Active Record Query = SELECT * FROM table WHERE deleted_at IS NULL
The standard filter. Requires a partial index on deleted_at for performance. Standard SQL pattern
Pipeline Churn Rate = soft_deleted_records / total_records
High churn indicates a highly dynamic target site (e.g., fast-fashion inventory). DataFlirt pipeline metrics
Hard Purge Threshold = current_datedeleted_at > 90 days
The retention window after which a soft delete becomes a physical hard delete. Data retention policy
// 04 — pipeline execution

Handling a 404
in the delivery layer.

Simulated trace of a daily catalog sync. The scraper encounters a missing product page and the delivery layer issues a soft delete to the client's data warehouse.

PostgreSQLETL SyncState Management
edge.dataflirt.io — live
CAPTURED
// sync job: catalog_update_v4
target.url: "https://target.com/p/sku-992"
http.status: 404 Not Found

// pipeline decision engine
action: "soft_delete"
reason: "target_removed"

// database execution
db.execute: "UPDATE products SET deleted_at = NOW() WHERE sku = 'sku-992'"
db.status: ok // 1 row updated

// downstream view refresh
view.active_products.count: 142,804
view.historical_products.count: 18,291

// data quality check
fk_integrity: pass // historical price points preserved
pipeline.status: running nominally
// 05 — failure modes

Where soft deletes
cause friction.

While safer than hard deletes, soft deletes introduce their own operational overhead. Ranked by frequency of issues observed in client data warehouses.

OBSERVED ISSUES ·  ·  ·   Trailing 12 months
IMPACT ·  ·  ·  ·  ·  ·   Query performance
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Query performance degradation

table bloat · Full table scans if deleted_at is unindexed
02

Forgotten WHERE clauses

zombie data · Downstream analysts forget to filter NULLs
03

Unique constraint violations

schema error · Re-inserting the same SKU fails on unique index
04

GDPR/CCPA compliance

legal risk · Failing to hard-delete PII when requested
05

Storage cost bloat

infrastructure · Paying for SSD storage on dead records
// 06 — our architecture

Never drop data,

until the retention policy demands it.

DataFlirt's delivery layer never issues a hard delete to a client's active dataset. When a target site removes a listing, we update the deleted_at timestamp and append a deletion reason code (e.g., 404, out of stock, schema mismatch). This ensures your pricing models can still train on historical data without polluting your live inventory feeds.

Record state transition

A live snapshot of a soft-deleted record in a managed PostgreSQL delivery sink.

record.id prod_8821a
status inactive
deleted_at 2026-05-19T08:14:22Z
deletion_reason HTTP 404
historical_prices 42 data points
hard_purge_scheduled 2026-08-17

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About state management, database performance, legal compliance, and how DataFlirt handles data lifecycle in scraping pipelines.

Ask us directly →
What's the difference between deleted_at and is_deleted? +
is_deleted is a boolean flag; deleted_at is a timestamp. Always use the timestamp. A boolean tells you a record is gone, but a timestamp tells you when it disappeared, which is critical for time-series analysis and debugging pipeline state changes.
Does a soft delete satisfy GDPR right to erasure? +
No. If the scraped data contains Personally Identifiable Information (PII), a soft delete is legally insufficient for an erasure request. The data is still on your disk and accessible to admins. PII requires a hard delete or cryptographic anonymisation to comply with GDPR/CCPA.
How do you handle unique constraints with soft deletes? +
If you have a unique constraint on sku, soft deleting a product and then re-scraping it later will throw a constraint violation. The standard fix is to change the unique index to a composite of (sku, deleted_at), or use a partial unique index: CREATE UNIQUE INDEX ON products (sku) WHERE deleted_at IS NULL;.
How does DataFlirt handle items that come back online? +
Our delivery layer uses upsert logic (INSERT ON CONFLICT). If a scraper rediscovers a URL that was previously soft-deleted, the pipeline updates the record, setting deleted_at = NULL and refreshing the scraped fields. The record is resurrected without losing its historical foreign key relationships.
Won't soft deletes slow down my database over time? +
Yes, if left unchecked. Table bloat forces the database to read more pages from disk even if most rows are filtered out. Mitigate this by using partial indexes on active records, partitioning tables by date, and implementing a hard-purge job that moves soft-deleted rows to cold storage (like S3) after 90 days.
When should I actually hard delete data? +
Hard deletes should be reserved for three scenarios: legal compliance (PII erasure requests), data pipeline errors (accidentally inserting garbage data that has no historical value), and automated archiving (moving 1-year-old soft-deleted records to a data lake to reclaim expensive SSD space).
$ dataflirt scope --new-project --target=soft-delete READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h