← Glossary / ELT Pipeline

What is ELT Pipeline?

An ELT pipeline (Extract, Load, Transform) is a modern data architecture where raw scraped data is loaded directly into a target data warehouse or lakehouse before any structural transformation occurs. By decoupling the extraction layer from the transformation logic, engineering teams can iterate on schemas without re-scraping the source. For web data, it's the difference between a brittle pipeline that drops records on schema drift and a resilient one that captures everything for downstream reconciliation.

Data EngineeringData WarehouseSchema EvolutionRaw Data Zonedbt

// 02 — definitions

Extract, load,
then figure it out.

Why moving transformation to the end of the pipeline is the only sane way to handle the chaos of web-scraped data.

Ask a DataFlirt engineer →

TL;DR

An ELT pipeline dumps raw, unopinionated JSON or HTML into a data lake (like S3 or GCS) or warehouse (like Snowflake or BigQuery) first. Transformation into structured, typed tables happens entirely within the warehouse using SQL or tools like dbt. This ensures you never lose data due to a parsing error during ingestion.

01Definition & structure

An ELT pipeline reverses the traditional data integration flow. Instead of extracting data, transforming it in a middle tier, and loading the clean records, ELT extracts the data and loads it immediately into the target system. The transformation logic is then executed using the compute power of the data warehouse itself. This architecture relies on cheap cloud storage and powerful columnar databases.

02Why it matters for web scraping

Web scraping is inherently unstable. Target websites change DOM structures, API responses, and data types without warning. If you use ETL, a changed CSS selector causes the transformation step to fail, and the record is dropped before it ever reaches your database. With ELT, the raw HTML or JSON is saved regardless. When you notice the schema drift, you fix your SQL transformation logic and re-run it over the raw data. No re-scraping required.

03The Medallion Architecture

ELT pipelines typically follow a medallion architecture:

Bronze (Raw): The exact JSON or HTML payload fetched by the scraper.
Silver (Cleaned): Data that has been parsed, typed, and deduplicated via SQL.
Gold (Aggregated): Business-level tables ready for BI tools and ML models.

This separation of concerns ensures that raw data is never mutated, providing a perfect audit trail.

04How DataFlirt handles it

We build our delivery infrastructure around ELT. When we extract data from a target, we stream the raw NDJSON directly to your S3 bucket or Snowflake stage. We handle the complex extraction—bypassing anti-bot systems, managing proxy rotation, and parsing the initial payload—but we leave the final business-logic transformations to your dbt models. This ensures you maintain total control over how the data is modeled.

05The cost trap of ELT

While storage is cheap, warehouse compute is not. The biggest mistake teams make with ELT is running full-table scans on their raw bronze layer every hour to rebuild their silver tables. For high-volume scraping pipelines, this will bankrupt your Snowflake credits. Efficient ELT requires incremental models (like dbt's is_incremental()) that only transform the newly loaded raw records.

// 03 — the ELT advantage

Why transform
after loading?

In traditional ETL, a transformation failure drops the record. In ELT, the raw data is already safely stored, allowing transformations to be re-run retroactively. DataFlirt's delivery model is built entirely on ELT principles.

Data Retention = R = raw_records / extracted_records

In ELT, R is always 1.0. No data is lost to mid-flight parsing errors. Data Engineering SLO

Transformation Latency = T_latency = query_execution_time

Leverages warehouse compute (e.g., Snowflake) rather than scraper worker memory. Modern Data Stack

DataFlirt Delivery SLA = D_time = fetch_time + network_io

Because we don't transform in-flight, raw delivery hits your S3 bucket in milliseconds. Internal Architecture

// 04 — the ELT flow

Raw JSON to
Snowflake table.

A trace of an ELT job processing scraped e-commerce data. The raw payload is loaded into a bronze layer, then transformed via dbt into a clean silver table.

S3 BronzeSnowflakedbt run

edge.dataflirt.io — live

CAPTURED

// 1. Extract & Load (DataFlirt)
job.id: "extract-amz-092"
payload.type: "ndjson"
target.sink: "s3://df-client-raw/bronze/2026-05-19/"
status: ok // 142,000 records loaded

// 2. Warehouse Ingestion (Snowpipe)
pipe.status: "running"
table.target: "raw_ecommerce_events"
rows.inserted: 142000

// 3. Transform (dbt)
dbt.command: "dbt run --select stg_ecommerce_prices"
model.stg_ecommerce_prices: running...
test.not_null_price: pass
test.accepted_currency: fail // 3 records failed coercion
model.stg_ecommerce_prices: created view

// 4. Outcome
pipeline.state: ok // raw data preserved, 141,997 clean rows available

// 05 — failure modes

Where ELT
pipelines break.

ELT solves the 'dropped data' problem of ETL, but introduces new challenges around storage costs, warehouse compute, and raw data governance. Here is what breaks downstream.

PIPELINES MONITORED · 300+ active

VOLUME · · · · · · 10B+ rows/mo

UPDATED · · · · · · 2026-05-19

01

Warehouse compute costs

% of failures · Unoptimized dbt models scanning full raw tables daily

02

Schema drift in raw JSON

% of failures · Downstream SQL fails when nested keys change unexpectedly

03

Storage bloat

% of failures · Retaining petabytes of raw HTML/JSON without lifecycle policies

04

Data swamp degradation

% of failures · Lack of documentation turns the bronze layer into an unusable mess

05

PII leakage

% of failures · Raw data loaded before masking exposes sensitive fields to analysts

// 06 — DataFlirt's delivery

We deliver the raw,

you own the transform.

DataFlirt is designed for modern ELT architectures. We don't force you into our proprietary schemas. We extract the raw JSON, HTML, or structured payload and stream it directly into your S3, GCS, or Azure blob storage. Your data engineering team then uses Snowflake, BigQuery, and dbt to model the data exactly how your business logic requires. If your pricing logic changes tomorrow, you just re-run your dbt models against the historical raw data we've already delivered.

Delivery payload

A standard DataFlirt raw delivery event.

format NDJSON

compression zstd

schema.validation loose

delivery.latency < 2s

destination s3://client-bronze-zone

idempotency_key req_8f92a1b

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About ELT architecture, warehouse costs, schema drift, and how DataFlirt integrates with modern data stacks.

Ask us directly →

What is the difference between ETL and ELT for web scraping? +

ETL transforms data in-memory before writing to the database; if parsing fails, the data is lost. ELT writes the raw scraped data to storage first, then uses the database's compute to transform it. ELT is vastly superior for scraping because websites change constantly.

Doesn't storing raw HTML and JSON get expensive? +

Storage is cheap; compute and re-scraping are expensive. Storing compressed NDJSON or Parquet in S3 costs fractions of a cent per gigabyte. The cost of missing a pricing anomaly because an ETL parser dropped a record is significantly higher.

How do you handle PII in an ELT pipeline? +

This is the main legal risk of ELT. If you load raw scraped data containing personal information directly into a warehouse, you risk GDPR/CCPA violations. We recommend a "shield" layer: DataFlirt can apply regex-based redaction at the edge before the "Load" step, ensuring the raw zone remains compliant.

Why does DataFlirt prefer delivering raw data? +

Because your business logic is yours. We are experts at bypassing anti-bot systems and extracting payloads at scale. You are the expert on how a "discounted price" should be calculated for your specific ML model. ELT creates a clean boundary of responsibility.

What tools are typically used in the 'Transform' step? +

Once DataFlirt loads the data into your warehouse (Snowflake, BigQuery, Redshift), the industry standard is dbt (Data Build Tool). It allows data engineers to write modular SQL SELECT statements to clean, cast, and aggregate the raw JSON into production-ready tables.

How do we handle schema drift if we don't transform first? +

You handle it gracefully. When a target site changes its JSON structure, the raw data still lands in your bronze layer. Your downstream dbt tests will fail, alerting you to the drift. You update your SQL model, re-run it, and the pipeline recovers with zero data loss.

$ dataflirt scope --new-project --target=elt-pipeline READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h