← Glossary / Right to Erasure

What is Right to Erasure?

The right to erasure (often called the right to be forgotten) is a legal mandate under frameworks like GDPR and CCPA that allows individuals to demand the deletion of their personal data. For scraping pipelines, it turns a simple append-only data lake into an architectural nightmare: you must locate the specific entity, purge it from all storage tiers, propagate the deletion to downstream consumers, and ensure your crawler doesn't simply re-ingest the exact same data on its next run.

GDPR Art. 17ComplianceData DeletionPIIData Governance

// 02 — definitions

Delete it,
keep it deleted.

The legal mandate that forces scraping operations to build mutable, highly tracked data architectures instead of simple append-only logs.

Ask a DataFlirt engineer →

TL;DR

The right to erasure requires organizations to delete personal data upon request within a strict timeframe (typically 30 days). In the context of web scraping, compliance means building pipelines capable of targeted physical deletions, emitting tombstone events to clients, and maintaining cryptographic blocklists so the crawler ignores the entity in the future.

01Definition & scope

The right to erasure is a data subject right enshrined in Article 17 of the GDPR and similar privacy laws (like the CCPA). It dictates that individuals can request the deletion of their personal data if it is no longer necessary for its original purpose, if they withdraw consent, or if the data was processed unlawfully. For data brokers and scraping operations, this applies directly to scraped personal information, regardless of whether the source was a public website.

02The re-ingestion problem

Deleting a record from a database is trivial. The complexity in scraping pipelines is that the source website still hosts the data. If you delete a profile on Tuesday, your incremental crawler will discover it as a "new" record on Wednesday and re-ingest it. Compliance requires a mechanism to remember not to scrape an entity, without actually storing the entity's personal data.

03Cryptographic blocklists

To solve the re-ingestion problem, modern pipelines use cryptographic blocklists. When a deletion request is processed, the system generates a one-way hash (e.g., SHA-256) of the unique identifier (like the profile URL or email address). The actual PII is purged, but the hash is stored in an edge cache. During extraction, the crawler hashes incoming identifiers and drops any record that matches the blocklist before it ever reaches the data lake.

04How DataFlirt handles it

We avoid scraping PII by default, focusing on business and product data. However, for enterprise pipelines that require public professional data, we implement a strict CDC (Change Data Capture) architecture. Deletions trigger a tombstone event that instantly filters the record from API endpoints and client deliveries. Physical deletion from our S3 data lakes occurs during automated weekly compaction runs, ensuring compliance without degrading pipeline read performance.

05Downstream propagation

If you sell or distribute scraped data, your compliance burden doesn't end at your own database. You must inform downstream consumers that the data has been erased. This is typically handled by delivering delta files or webhook payloads containing deletion instructions. If a client fails to process these tombstones, they assume the liability for holding erased data.

// 03 — compliance metrics

Measuring deletion
efficacy.

A deletion request isn't complete until the data is purged from active storage, backups, and downstream feeds. DataFlirt tracks propagation latency across all managed sinks to ensure strict SLA compliance.

Deletion Latency = T_purge = T_request − T_{cleared_all_sinks}

GDPR mandates < 30 days. We target < 24 hours for active storage. Compliance SLA

Re-ingestion Block Rate = B = blocked_fetches / total_erasure_hashes

Measures how often the crawler attempts to re-fetch legally deleted entities. Crawler Edge Metrics

Downstream Propagation = P = ack_receipts / active_consumers

Ensuring clients have successfully processed the tombstone record. CDC Delivery Pipeline

// 04 — the deletion trace

Executing an
erasure request.

Trace of an automated deletion workflow triggered by a data subject request. The system purges the record, issues a tombstone to consumers, and updates the crawler blocklist.

GDPR Art. 17tombstone eventhash blocklist

edge.dataflirt.io — live

CAPTURED

// inbound request
req.type: "erasure_request"
req.subject_id: "usr_8847291a"

// locate and purge
db.lakehouse: purged 1 record
db.backups: scheduled for compaction (7d TTL)

// prevent re-ingestion
blocklist.append: sha256("usr_8847291a")
crawler.rule: ignore_hash_match

// downstream notification
kafka.topic: "cdc_deletes"
event.payload: {"id": "usr_8847291a", "op": "d", "reason": "erasure"}
status: 200 OK // compliance SLA met

// 05 — failure modes

Where erasure
pipelines fail.

Ranked by frequency of compliance breaches in large-scale scraping operations. The hardest part isn't the initial delete — it's the architectural side-effects.

PIPELINES AUDITED · · 140+ active

SLA TARGET · · · · · < 72 hours

UPDATED · · · · · · 2026-05-19

01

Re-scraping deleted entities

88% of breaches · Crawler lacks blocklist awareness

02

Orphaned data in backups

72% of breaches · Immutable storage prevents targeted deletes

03

Downstream consumer sync

65% of breaches · Clients fail to process tombstone events

04

Identity resolution failure

41% of breaches · Unable to find all records for a subject

05

Log file leakage

29% of breaches · PII inadvertently stored in scraper debug logs

// 06 — architecture

Append-only lakes,

meet mutable compliance.

DataFlirt handles erasure through a tombstone-and-compact architecture. When a deletion request is verified, we don't run expensive DELETE statements across petabytes of Parquet files. We emit a tombstone event that immediately filters the record from all read paths and downstream feeds. Background compaction jobs then physically scrub the data from disk during off-peak hours, while a one-way cryptographic hash of the entity's identifier is added to the crawler's edge blocklist to guarantee it is never re-ingested.

Erasure Job Status

Live state of a GDPR deletion request across the pipeline.

request.id erq-2026-091

read_path.status filtered

physical_storage pending compaction

crawler.blocklist hash appended

downstream.sync tombstone emitted

sla.deadline 28 days remaining

compliance.state satisfied

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About legal obligations, re-ingestion prevention, backup handling, and how DataFlirt manages compliant data deletion at scale.

Ask us directly →

Does the right to erasure apply to publicly available data? +

Yes. Under GDPR, personal data is protected regardless of whether it was scraped from a public directory, a social media profile, or submitted privately. Public availability does not negate privacy rights or the obligation to delete the data upon request.

How do you stop the scraper from just fetching the deleted person again? +

We use cryptographic blocklists. We hash the identifying URL or primary key and store only the hash. Before the crawler writes a new record to the lake, it checks the hash against the blocklist. If it matches, the record is dropped in memory. This prevents re-ingestion without storing the actual PII.

Do we have to delete data from our backups immediately? +

Generally, no. Most Data Protection Authorities (DPAs) accept that purging immutable backups instantly is technically infeasible. The standard practice is to put the data "beyond use" in active systems and ensure it is deleted when the backup is naturally overwritten or compacted, provided it isn't restored in the meantime.

How does DataFlirt notify clients when a record is deleted? +

We use Change Data Capture (CDC) streams. Deletions are emitted as tombstone events (e.g., op: "d") in the delivery payload. Clients must configure their ingestion pipelines to process these deletes and mirror the erasure in their own downstream systems to remain compliant.

What if the data subject requests erasure, but we need the data for legal reasons? +

GDPR Article 17 includes exemptions, such as exercising the right of freedom of expression, complying with a legal obligation, or establishing legal claims. However, commercial data brokers and standard scraping operations rarely qualify for these exemptions and must comply with the request.

Can we just anonymize the data instead of deleting it? +

Yes. If data is irreversibly anonymized such that the individual can no longer be identified (even via mosaic effects or cross-referencing), it falls outside the scope of GDPR and CCPA. True anonymization satisfies the erasure requirement while allowing you to keep aggregate statistical value.

$ dataflirt scope --new-project --target=right-to-erasure READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h