← Glossary / SQL Transformation

What is SQL Transformation?

SQL transformation is the process of converting raw, ingested data into structured, business-ready models using SQL. In modern ELT pipelines, data is loaded into a warehouse first, and transformations happen in-place. For scraping pipelines, this is where nested JSON blobs from API responses are flattened, currency strings are coerced into numerics, and duplicate records are merged. If your extraction layer gets the bytes, your transformation layer builds the truth.

ELTdbtData ModelingWarehouseNormalization

// 02 — definitions

Raw data,
refined.

The shift from ETL to ELT means SQL is now the dominant language for shaping scraped data into analytics-ready tables.

Ask a DataFlirt engineer →

TL;DR

SQL transformation takes raw data sitting in a warehouse (like Snowflake or BigQuery) and applies business logic to clean, join, and aggregate it. Tools like dbt have standardized this layer, turning SQL scripts into version-controlled, testable software artifacts. It's the critical bridge between a messy web scrape and a clean dashboard.

01Definition & structure

SQL transformation is the execution of SQL queries to clean, join, aggregate, and structure data that has already been loaded into a data warehouse. It typically follows a medallion architecture:

Bronze (Raw): The exact JSON or CSV payload delivered by the scraper.
Silver (Staging): SQL is used to flatten JSON, cast data types, and deduplicate records.
Gold (Mart): SQL joins the scraped data with internal company data to build final analytics tables.

02How it works in practice

Instead of writing Python scripts to manipulate data in memory, data engineers write `SELECT` statements. Tools like dbt wrap these statements, managing the boilerplate DDL (Data Definition Language). When you run the transformation job, the tool compiles your `SELECT` into a `CREATE TABLE AS` or `MERGE` statement and executes it directly on the warehouse compute cluster.

03Handling scraped data

Scraped data requires heavy transformation. A single API response might contain a product, its variants, and its reviews all nested in one JSON object. SQL transformations use functions like `FLATTEN` to explode arrays into rows, regex functions to strip currency symbols from price strings, and window functions (`ROW_NUMBER() OVER (PARTITION BY sku ORDER BY scraped_at DESC)`) to keep only the freshest record.

04How DataFlirt handles it

We believe transformation belongs in your warehouse, not our scraper. We extract the data, validate it against a strict schema, and deliver it raw. We can push directly to your Snowflake, BigQuery, or S3 buckets. To accelerate your time-to-value, we provide the baseline dbt models needed to parse our delivery schemas, letting your team focus on the business logic rather than JSON pathing.

05The cost of bad SQL

Because cloud warehouses separate storage and compute, bad SQL is expensive. Running a full table scan on a multi-terabyte table of historical scrape data every day just to append 10,000 new records will burn through compute credits rapidly. Implementing incremental materializations and clustering keys on your `scraped_at` timestamps is mandatory for cost control.

// 03 — transformation metrics

How efficient
is your DAG?

Warehouse compute is billed by time or bytes scanned. Inefficient SQL transformations on large scraped datasets will quietly drain your budget. We track these metrics to optimize downstream costs.

Incremental efficiency = E = bytes_scanned_incremental / bytes_scanned_full

Should be < 0.05 for daily scrapes. Rebuilding full tables daily is an anti-pattern. Data Engineering SLO

Transformation cost per run = C = compute_credits × rate_per_credit

The hard dollar cost of executing the SQL DAG in Snowflake or BigQuery. Cloud Vendor Billing

Data freshness delay = Δt = t_transform_complete − t_scrape_finish

The latency introduced by the transformation layer before data is queryable. Pipeline SLA

// 04 — dbt execution trace

Compiling and running
a pricing model.

A standard dbt run transforming raw scraped JSON from an e-commerce target into a clean, slowly changing dimension table.

dbt coreSnowflakeincremental

edge.dataflirt.io — live

CAPTURED

// initialize dbt run
$ dbt run --select stg_ecommerce_pricing
14:02:11 Running with dbt=1.7.3
14:02:11 Found 1 model, 2 tests, 1 source

// execute model
14:02:12 Concurrency: 4 threads (target='prod')
14:02:12 1 of 1 START incremental model silver.stg_ecommerce_pricing
14:02:13 compiling SQL...
14:02:13 executing: MERGE INTO silver.stg_ecommerce_pricing USING...

// run tests
14:02:27 1 of 1 OK created incremental model silver.stg_ecommerce_pricing [SUCCESS 1 in 14.2s]
14:02:28 START test unique_stg_ecommerce_pricing_sku
14:02:29 PASS unique_stg_ecommerce_pricing_sku [PASS in 1.1s]

// summary
14:02:29 Finished running 1 incremental model, 1 test in 18.1s.
14:02:29 Completed successfully

// 05 — failure modes

Where SQL models
break down.

Scraped data is inherently messy. When upstream schemas drift or nulls appear unexpectedly, downstream SQL transformations fail. Ranked by frequency in production ELT pipelines.

PIPELINES MONITORED · 300+ active

DAG RUNS · · · · · 10k+ daily

UPDATED · · · · · · 2026-05-19

01

Unhandled schema drift

% of failures · New fields or changed types break hardcoded SQL

02

JSON parsing errors

% of failures · Malformed arrays in variant columns

03

Full table scan timeouts

% of failures · Missing incremental logic on multi-TB tables

04

Deduplication logic flaws

% of failures · Window functions missing edge-case partitions

05

Type coercion failures

% of failures · Casting strings with hidden characters to INT

// 06 — the ELT handoff

Deliver raw,

transform locally.

We advocate for the ELT pattern. DataFlirt delivers raw, validated JSON or Parquet directly to your data lake or warehouse. We don't force our business logic onto your data. Instead, we provide the raw material and the baseline dbt models required to parse it, allowing your data engineering team to own the final SQL transformations that join our scrape with your internal dimensions.

dbt_run_results.json

Execution summary of a daily transformation job.

model.name stg_competitor_prices

materialization incremental

rows.inserted 14,205

rows.updated 2,104

tests.passed unique, not_null

tests.failed 0

execution.time 14.2s

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About SQL transformations, ELT vs ETL, handling nested JSON, and how DataFlirt integrates with your warehouse.

Ask us directly →

Why use ELT instead of ETL for scraped data? +

With ETL, transformations happen before the data hits your warehouse. If a transformation rule is flawed (e.g., a regex drops valid prices), the original data is lost. With ELT, you load the raw JSON into the warehouse first, then transform it using SQL. If the SQL is wrong, you simply fix the query and rerun it against the raw data. ELT is vastly safer for web scraping.

How do you handle nested JSON in SQL? +

Modern cloud data warehouses (Snowflake, BigQuery, Redshift) have native support for semi-structured data. You load the JSON into a VARIANT or JSON column, and use native SQL functions like FLATTEN or UNNEST to expand arrays and extract keys using dot notation.

What is an incremental model? +

An incremental model only processes new or updated records since the last run, rather than dropping and rebuilding the entire table. For a scraping pipeline collecting millions of records daily, incremental SQL transformations are mandatory to keep compute costs from spiraling out of control.

How does DataFlirt support my SQL transformations? +

We deliver raw data to your S3 bucket or directly into your Snowflake/BigQuery instance. Alongside the data, we provide versioned schemas and optional dbt packages that map 1:1 with our delivery formats. You get a head start on the staging layer, and you own the final business logic.

How do you test SQL transformations? +

Using tools like dbt test or Great Expectations. You write assertions in YAML or SQL to check that primary keys are unique, foreign keys match, and critical fields (like price or SKU) are never null. If a test fails, the pipeline halts before bad data reaches your dashboards.

What happens to my SQL when the target site changes? +

Our extraction layer catches the site change and versions the schema. Your SQL models will likely need an update to handle the new structure. Because we deliver raw data, once your data engineering team updates the SQL, you can backfill historically without needing to re-scrape the target.

$ dataflirt scope --new-project --target=sql-transformation READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h