← Glossary / Data Retention Policy

What is Data Retention Policy?

Data retention policy is the formal protocol dictating how long scraped or ingested data is stored, where it resides during its lifecycle, and exactly when it is permanently destroyed. In scraping pipelines, retention isn't just about managing AWS S3 costs, it is a critical compliance boundary. Holding PII or copyrighted material longer than legally justified transforms a benign data lake into a massive liability.

Lifecycle ManagementComplianceStorage CostsGDPRData Deletion
// 02 — definitions

Keep it,
then kill it.

The mechanics of moving data from hot storage to cold archives, and eventually to the incinerator, without breaking downstream analytics.

Ask a DataFlirt engineer →

TL;DR

A data retention policy automates the lifecycle of ingested records. It defines the transition from high-cost, low-latency storage to cheaper archival storage, and enforces hard deletion deadlines to comply with privacy frameworks like GDPR and CCPA. Without it, storage costs compound endlessly and legal risk scales linearly with database size.

01Definition & structure
A data retention policy is a documented, automated set of rules that governs the lifecycle of data within a system. It specifies how long data is kept in active databases, when it is moved to cheaper archival storage, and the exact date it must be permanently destroyed. In data engineering, these policies are implemented as code—using cloud provider lifecycle rules, database partition dropping, or scheduled Airflow DAGs.
02The lifecycle stages
A standard scraping pipeline retention lifecycle has four phases:
  • Hot — Raw HTML and fresh structured records. Kept on fast SSDs or S3 Standard for 7–14 days for immediate querying and extraction debugging.
  • Warm — Structured data moved to cheaper tiers (S3 Infrequent Access) for 30–90 days. Accessible, but with retrieval latency.
  • Cold — Deep archive (Glacier). Kept for 1–7 years for legal compliance or historical ML baselines. Retrieval takes hours.
  • Dead — Hard deletion. The data is cryptographically wiped from all systems and backups.
03Legal and compliance triggers
Retention policies are not just cost-saving measures. Under privacy laws like GDPR and CCPA, holding personal data indefinitely is illegal. You must define a specific business purpose for the data and delete it when that purpose is fulfilled. If a user exercises their Right to Erasure, the retention policy is bypassed, and the specific records must be purged immediately across all hot, warm, and cold storage tiers.
04How DataFlirt handles it
We operate as a transient pipeline, not a permanent data warehouse. Our default retention policy holds raw fetched payloads for 14 days to allow for selector repair and backfilling. Structured output is held on our delivery edge for 30 days. After 30 days, all client-specific data is hard-deleted from our infrastructure. We enforce this via immutable AWS S3 lifecycle rules, ensuring zero risk of stale data exposure.
05The "Soft Delete" trap
Many teams implement retention by adding an is_deleted = true flag to their database rows. This is a soft delete. It hides the data from the application layer but leaves it physically on the disk. During a compliance audit or a data breach, soft-deleted data is fully exposed. A true retention policy requires hard deletion—executing a DELETE statement or dropping the storage partition entirely.
// 03 — the math

Why infinite storage
is a trap.

Unbounded retention means costs grow infinitely even if your ingestion rate is flat. DataFlirt models retention policies to balance debuggability against storage bloat and compliance exposure.

Storage cost compound = C = Vin · T · Rstorage
Without deletion, costs grow linearly with time (T) even if volume (V) is constant. Cloud economics baseline
Liability exposure = L = RecordsPII × Daysoverdue × Fpenalty
Regulatory fines scale with the volume and duration of improperly retained data. GDPR penalty modeling
DataFlirt archival ratio = A = Cold_Storage / (Hot_Storage + Cold_Storage)
Target > 0.85 for pipeline data older than 30 days. Internal infrastructure SLO
// 04 — lifecycle execution

Automated purge
at the bucket level.

A nightly Airflow DAG triggering an S3 lifecycle transition and compliance purge on a raw HTML payload bucket.

S3 LifecycleAirflowHard Deletion
edge.dataflirt.io — live
CAPTURED
// init retention sweep
job.id: "retention_policy_enforcer_v3"
target.prefix: "s3://df-client-042/raw_html/"

// phase 1: hot to cold transition
rule.match: "age > 30d"
action: "transition_to_glacier_deep_archive"
records.moved: 14,204,811
storage.freed: 4.2 TB // hot storage reduced

// phase 2: compliance hard deletion
rule.match: "age > 90d AND contains_pii == true"
action: "permanent_delete"
records.deleted: 2,104,550
status: ok

// verification
audit.log: "written to s3://df-audit/retention/2026-05-19/"
compliance.status: verified
// 05 — policy drivers

What dictates
retention limits.

The factors that force data engineering teams to implement strict lifecycle rules, ranked by their impact on pipeline architecture.

AVG RAW RETENTION ·  ·    14 days
AVG COLD RETENTION ·  ·   365 days
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Regulatory mandates

GDPR / CCPA · Hard legal limits on holding personal data
02

Storage cost optimization

S3 Standard · Moving terabytes of raw HTML to Glacier
03

Contractual embargoes

Client SLAs · Agreements dictating when data must be destroyed
04

ML training requirements

Historical · Need for longitudinal baselines delays deletion
05

Schema evolution

Format rot · Old data becomes unparseable and useless
// 06 — infrastructure enforcement

Automate the purge,

because humans will forget to delete it.

DataFlirt enforces retention policies at the infrastructure layer, not the application layer. We use cloud-native lifecycle rules bound to specific S3 prefixes and database table partitions. When a client's contract specifies a 14-day retention limit for raw HTML payloads, the bucket itself enforces the deletion. No engineer has the access to override a compliance purge. This guarantees that stale, potentially sensitive data is cryptographically shredded exactly when the policy dictates.

Retention Job Status

Live output from a compliance purge on a European data pipeline.

policy.id ret-gdpr-eu-14d
target.prefix s3://df-raw-eu-central/
records.scanned 45.2M
action.glacier 12.1M records
action.delete 3.4M records
audit.trail cryptographically signed
job.status completed

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about data lifecycles, compliance boundaries, and how DataFlirt manages storage at scale.

Ask us directly →
What is the difference between archiving and retention? +
Archiving is the act of moving data from hot, expensive storage to cold, cheap storage for long-term keeping. A retention policy dictates the entire timeline, including archiving, but crucially defines the final end-of-life: when the data must be permanently deleted.
How does GDPR affect scraping retention? +
The GDPR's storage limitation principle mandates that personal data be kept no longer than is necessary for the purposes for which it is processed. You cannot hoard scraped PII "just in case" you need it later. You must define a justifiable timeframe and enforce it.
Can we just soft-delete records to comply with retention policies? +
No. Soft deletes flag a record as inactive so it doesn't appear in queries, but the data remains on disk. For regulatory compliance and true cost savings, hard deletion (physical removal from the storage medium) is required.
How does DataFlirt handle client data retention? +
We hold raw fetched payloads for 14 days to allow for extraction debugging and schema backfilling. Structured output data is held for 30 days on our delivery edge. After that, it is purged. Clients are responsible for their own long-term data warehousing.
What if we need historical scraped data for machine learning? +
Anonymise it. Once data is stripped of PII and identifiers, it generally falls outside privacy retention mandates and can be kept indefinitely for training baselines. Separate your raw PII storage from your anonymised ML feature store, and apply aggressive retention to the former.
How do you prove that data was actually deleted? +
Through automated audit logs generated by the cloud provider. When an S3 lifecycle rule deletes an object, AWS logs the event. We aggregate these logs and provide cryptographic proof of deletion for enterprise clients undergoing compliance audits.
$ dataflirt scope --new-project --target=data-retention-policy READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h