← Glossary / Data Archiving

What is Data Archiving?

Data archiving is the automated process of moving historical, infrequently accessed scraped data from expensive hot storage (like PostgreSQL or Elasticsearch) to cost-effective cold storage (like S3 Glacier). It is not a backup — it is a lifecycle transition for data that must be retained for compliance, longitudinal analysis, or model training, but no longer requires millisecond query latency. Without a strict archiving policy, a high-volume scraping pipeline will eventually crush its own database under the weight of its historical success.

Cold StorageData LifecycleS3 GlacierCost OptimizationFinOps

// 02 — definitions

Cold storage,
warm access.

The mechanics of moving terabytes of historical scraped data out of the query path without losing the ability to retrieve it when the data science team asks for a five-year backtest.

Ask a DataFlirt engineer →

TL;DR

Data archiving shifts inactive records from high-performance databases to low-cost object storage. It reduces active database size, speeds up daily queries, and cuts infrastructure costs by up to 80%, while preserving the raw historical data needed for machine learning and compliance audits.

01Definition & structure

Data archiving is the systematic relocation of data that is no longer actively used to a separate storage system for long-term retention. In a scraping context, this usually means moving raw HTML, JSON payloads, or parsed records older than 30–90 days out of the primary transactional database and into object storage.

02How it works in practice

A scheduled job (often orchestrated by Airflow or cron) queries the active database for records past the retention threshold. It extracts these records, transforms them into a compressed format (like Parquet), uploads them to a cold storage tier (like S3 Glacier), verifies the checksum, and finally deletes the original records from the active database to reclaim disk space.

03The FinOps angle

Archiving is fundamentally a financial operation. Hot database storage (SSD-backed, highly available) is expensive. Cold object storage is incredibly cheap but carries retrieval latency and egress fees. The goal is to match the storage cost to the data's business value over time, ensuring you aren't paying premium database rates to store three-year-old product prices that are queried once a quarter.

04How DataFlirt handles it

We use time-based table partitioning in our active databases. Instead of running expensive DELETE queries to remove old data, we simply detach and drop the oldest partition once its contents have been successfully written to our S3 archive tier. This zero-downtime approach keeps our ingestion pipelines running at maximum throughput without locking tables.

05Did you know?

Many teams mistakenly archive data in raw JSON or CSV formats. Because cold storage is often queried using serverless engines like Athena that charge by the amount of data scanned, querying a 5-year JSON archive can cost hundreds of dollars per query. Parquet reduces this cost by up to 99% by allowing the engine to scan only the specific columns requested.

// 03 — the economics

When does archiving
pay off?

The decision to archive is a function of storage cost deltas, query frequency, and retrieval latency requirements. DataFlirt's lifecycle manager uses these metrics to automatically tier client datasets.

Storage Cost Delta = C_hot − C_cold

Typically ~$0.10/GB (hot DB) vs ~$0.004/GB (cold object storage). AWS Pricing 2026

Archive Threshold = Days > 90 AND QueryFreq < 0.01/day

Standard rule for moving product pricing data to cold storage. DataFlirt Lifecycle Policy

Retrieval Penalty = T_restore + (GB × Cost_egress)

The time and cost to rehydrate archived data for a backtest. FinOps Model

// 04 — lifecycle transition

Moving 400M records
to Glacier.

A scheduled cron job executing a data lifecycle policy. It identifies records older than 90 days, compresses them into Parquet, writes to S3 Glacier, and soft-deletes from the active PostgreSQL cluster.

cronpg_dumpparquets3-glacier

edge.dataflirt.io — live

CAPTURED

// init archive job: job-arch-20260519
target.table: "raw_product_listings"
filter.condition: "scraped_at < NOW() - INTERVAL '90 days'"
records.matched: 412,840,119

// extract and compress
export.format: "apache_parquet"
compression: "snappy"
export.status: complete "184.2 GB written locally"

// upload to cold storage
aws.s3.bucket: "df-archive-tier-04"
aws.s3.class: "GLACIER_IR"
upload.progress: 100% "checksum verified"

// active database cleanup
db.delete: "DELETE FROM raw_product_listings WHERE..."
db.vacuum: running // reclaiming disk space
job.status: SUCCESS "412M records archived"

// 05 — archiving triggers

Why pipelines
move data to cold.

The primary drivers for transitioning scraped data out of active databases, ranked by frequency across DataFlirt's managed enterprise pipelines.

PIPELINES · · · · · 300+ managed

AVG RETENTION · · · · 90 days hot

UPDATED · · · · · · 2026-05-19

01

Storage cost optimization

FinOps · Hot DB storage is 20x more expensive than S3

02

Query performance drop

Latency · Massive tables slow down daily aggregations

03

Compliance mandates

Legal · Must keep 5 years of data, rarely queried

04

Machine learning backtests

Data Science · Periodic bulk loads for model training

05

Schema deprecation

Engineering · Old schema versions retired from active use

// 06 — our architecture

Never delete,

just change the access tier.

DataFlirt's managed pipelines treat data archiving as a continuous, automated lifecycle rather than a manual cleanup task. We partition active tables by ingestion date, allowing us to drop old partitions instantly rather than running expensive DELETE queries. The historical data is transformed into columnar Parquet files, partitioned by year and month, and shipped to S3 Glacier. When a client needs to run a longitudinal pricing study across three years of data, they don't query the active database — they use Athena or DuckDB to query the archive directly. The data is always there, but you only pay for performance when you actually need it.

Lifecycle Policy: df-price-history

Active retention policy for a high-volume retail scraping pipeline.

tier.hot PostgreSQL · 0-90 daysactive

tier.warm S3 Standard · 91-365 days

tier.cold S3 Glacier · 1-5 yearsarchived

format.cold Parquet + Snappy

partitioning year/month/target_domain

auto_delete after 5 yearscompliance

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About data lifecycles, retrieval costs, compliance, and how DataFlirt manages historical scraped data.

Ask us directly →

What is the difference between a backup and an archive? +

A backup is a copy of active data used for disaster recovery — if the database crashes, you restore the backup. An archive is the primary copy of inactive data that has been intentionally removed from the active database to save space and cost. Backups are for recovery; archives are for retention.

Why not just keep everything in the active database? +

Cost and performance. Storing 10 terabytes of historical scraped data in a managed PostgreSQL instance costs thousands of dollars a month and slows down index maintenance, backups, and daily queries. Moving that same 10TB to S3 Glacier costs about $40 a month and keeps your active database lean and fast.

How do I query archived data if it's not in the database? +

You don't load it back into the database. You store the archive in a columnar format like Parquet and use a query engine like AWS Athena, Google BigQuery, or DuckDB to query the files directly in object storage. This is the foundation of a modern data lake architecture.

What format should I use for archived scraped data? +

Apache Parquet. It is columnar, highly compressible, and natively supported by almost every modern data tool. Archiving in raw JSON or CSV wastes storage space and makes future analytical queries painfully slow and expensive because query engines have to scan the entire file.

How does DataFlirt handle data retrieval from cold storage? +

For clients on our managed infrastructure, we expose the archived Parquet files via secure S3 buckets. If you need to run a massive historical backtest, you can point your own compute (like Spark or Snowflake) directly at the bucket. We handle the lifecycle transitions automatically based on your contracted retention policy.

Are there legal reasons to archive scraped data? +

Yes. In many jurisdictions, retaining a historical snapshot of exactly what was publicly visible on a specific date serves as an evidentiary record. If a target site claims you scraped proprietary data, having the archived raw HTML or JSON from that exact day proves what was actually exposed.

$ dataflirt scope --new-project --target=data-archiving READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h