← Glossary / dbt Model

What is dbt Model?

A dbt model is a modular SQL SELECT statement that transforms raw data into a clean, structured table or view within a data warehouse. In scraping pipelines, it is the critical translation layer that takes messy, nested JSON payloads extracted from the web and normalizes them into typed, deduplicated records ready for downstream analytics or machine learning consumption.

Data EngineeringSQLTransformationDAGAnalytics Engineering
// 02 — definitions

SQL as
code.

How modern data teams turn raw scraped payloads into reliable, tested datasets without writing brittle Python scripts.

Ask a DataFlirt engineer →

TL;DR

A dbt model is simply a .sql file containing a single SELECT statement, enhanced with Jinja templating. Instead of writing complex DDL/DML statements to create tables or manage inserts, you define the business logic, and dbt handles the materialization in your warehouse (Snowflake, BigQuery, Redshift).

01Definition & structure

A dbt model is a single .sql file containing a SELECT statement. It represents a single logical step in a data transformation pipeline. Unlike traditional stored procedures, you do not write CREATE TABLE or INSERT INTO statements. You simply write the query that produces the desired output, and dbt wraps it in the appropriate DDL/DML based on your configuration.

Models use Jinja templating to reference other models dynamically. By using the ref() function instead of hardcoding table names, dbt automatically infers the dependency graph (DAG) and runs the models in the correct order.

02Materialization strategies

How a model is physically stored in the database is called its materialization. The four primary types are:

  • View: Recomputes the query every time it is queried. Good for lightweight staging models.
  • Table: Rebuilds the entire dataset from scratch on every run. Good for small dimension tables.
  • Incremental: Only processes new or changed records. Essential for high-volume scraped data.
  • Ephemeral: Does not build in the database at all; compiles as a CTE (Common Table Expression) injected into downstream models.
03Handling raw scraped data

Scraping pipelines typically dump raw HTML or JSON into a data lake. The first layer of dbt models (often called the Bronze or Staging layer) is responsible for reading this raw payload, flattening nested arrays, casting strings to dates or floats, and renaming cryptic source fields into standardized business terminology. This isolates the messy reality of web data from the clean analytics tables downstream.

04How DataFlirt handles it

We believe transformation belongs in the warehouse, not the scraper. Our pipelines extract data and load it into your environment in its rawest, highest-fidelity form. To accelerate your time-to-value, we provide pre-written dbt models tailored to the specific schemas of the sites we scrape. You drop these models into your repository, run dbt build, and immediately get clean, deduplicated tables without writing the JSON parsing logic yourself.

05Testing and documentation

A dbt model is only as good as its tests. Because web data changes without warning, models must be paired with schema tests defined in a schema.yml file. If a target site alters their pagination and causes your scraper to ingest duplicate records, a unique test on the primary key will fail the dbt run, alerting your engineering team before the bad data reaches your business intelligence tools.

// 03 — the transformation cost

How expensive
is your DAG?

Running dbt models on raw scraped data can rack up warehouse compute costs if not optimized. We track these metrics to ensure our incremental models run efficiently without burning credits.

Incremental build time = T = scan_time + (new_rows × transform_cost)
Full refreshes scale with total rows; incremental models scale only with the delta. Warehouse compute model
DAG execution cost = C = Σ (model_runtime × credit_rate)
Optimize by materializing intermediate steps as ephemeral or views, not tables. FinOps baseline
DataFlirt model efficiency = E = scraped_rows_ingested / dbt_run_seconds
Target > 15k rows/sec on standard Snowflake X-Small warehouse. Internal SLO
// 04 — model execution

Building the
silver layer.

A dbt run executing an incremental model that flattens and deduplicates raw e-commerce product data scraped over the last 6 hours.

dbt coreincrementalSnowflake
edge.dataflirt.io — live
CAPTURED
// trigger model execution
$ dbt run --select stg_ecommerce_products
14:02:11 Found 1 model, 4 tests, 1 source
14:02:11 Concurrency: 4 threads (target='prod')

// execution trace
1 of 1 START incremental model silver.stg_ecommerce_products [RUN]
Executing: MERGE INTO silver.stg_ecommerce_products ...
Parsed: 142,051 raw JSON records
Flattened: nested variants array
Applied: deduplication window (qualify row_number() == 1)
1 of 1 OK created incremental model silver.stg_ecommerce_products [SUCCESS in 14.2s]

// run schema tests
$ dbt test --select stg_ecommerce_products
PASS unique_stg_ecommerce_products_sku
PASS not_null_stg_ecommerce_products_price
// 05 — failure modes

Where models
break down.

Scraped data is inherently unstable. When target sites change, the raw data drifts, and dbt models are the first line of defense to catch the anomaly before it poisons downstream dashboards.

MODELS MONITORED ·  ·  ·  1,200+ active
TEST COVERAGE ·  ·  ·  ·  100% primary keys
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Schema drift / missing keys

JSON path fails · Source site removed a field from their API payload
02

Type coercion errors

cast failure · String 'N/A' appears in a numeric price field
03

Duplicate primary keys

test failure · Pagination overlap in raw scrape causes duplicate rows
04

Full refresh timeouts

compute limit · Incremental logic fails, forcing a massive historical rebuild
05

DAG circular dependencies

compile error · Poor architecture referencing models in a loop
// 06 — architecture

Transform at the destination,

not in the scraper.

Historically, scraping pipelines tried to clean, cast, and normalize data in Python before writing to the database. This creates brittle scrapers and hides the raw data. The modern approach is ELT: extract and load the raw JSON exactly as the site served it, then use dbt models to transform it inside the warehouse. If a parsing rule breaks, you don't need to re-scrape the target — you just fix the dbt model and rebuild the table.

stg_ecommerce_products.sql

Metadata for a standard staging model processing raw scraped JSON.

materialized incremental
unique_key product_id
source_table raw_scrape_payloads
upstream_deps 1 source
downstream_deps 4 models
schema_tests unique, not_null
last_run_status success · 14.2s

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about using dbt models to process, clean, and test data extracted from web scraping pipelines.

Ask us directly →
What is the difference between a dbt model and a standard SQL view? +
A standard SQL view is just a database object. A dbt model is a version-controlled file that generates that view (or table). Because dbt models use Jinja templating, they can reference other models dynamically, loop through variables, and automatically build a dependency graph (DAG) that standard SQL cannot do on its own.
Why use dbt for scraped data instead of Pandas or Python? +
Python is great for fetching data, but terrible for managing database state. If a Pandas script fails halfway through an upsert, you are left with partial data. dbt pushes the compute to the data warehouse, leverages native SQL for set-based operations, and handles the DDL/DML (like MERGE statements) automatically. It separates extraction logic from transformation logic.
How do you handle nested JSON from APIs in a dbt model? +
Modern data warehouses (Snowflake, BigQuery) have native JSON parsing functions. In your dbt model, you select the raw JSON column and use path traversal (e.g., raw_data:product:price::float) to extract and cast the fields into standard columnar formats.
What is an incremental model? +
An incremental model only processes new or updated records since the last time the model ran, rather than rebuilding the entire table from scratch. For scraping pipelines that collect millions of rows daily, incremental models are mandatory to keep warehouse compute costs from exploding.
How does DataFlirt integrate with a client's dbt project? +
We deliver raw, unopinionated scraped data directly to your S3 bucket or Snowflake instance. Alongside the data, we provide starter dbt models (staging and base transformations) via a GitHub repository. Your data engineering team can pull these models into your existing dbt project and customize the downstream DAG as needed.
How do you test scraped data quality in dbt? +
You define YAML tests alongside your models. At a minimum, assert that primary keys (like a scraped URL or product ID) are unique and not_null. You can also add accepted_values tests for categorical fields, or use packages like dbt-expectations to ensure scraped prices fall within a realistic numeric range.
$ dataflirt scope --new-project --target=dbt-model READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h