← Glossary / CTE (Common Table Expression)

What is CTE (Common Table Expression)?

CTE (Common Table Expression) is a temporary, named result set created within the execution scope of a single SQL statement. In scraping pipelines, CTEs are the backbone of the transformation layer, allowing engineers to break down complex, multi-step data cleaning operations—like deduplication, currency normalization, and hierarchical category flattening—into readable, sequential blocks. Without them, downstream data activation becomes a nightmare of nested subqueries and unmaintainable views.

SQLData TransformationQuery OptimizationdbtReadability
// 02 — definitions

Unnesting
the pipeline.

How temporary result sets turn 400-line SQL monstrosities into modular, debuggable data transformations.

Ask a DataFlirt engineer →

TL;DR

A CTE acts as a virtual table that exists only for the duration of a query. It replaces nested subqueries, making complex transformations readable and allowing recursive operations for hierarchical scraped data (like nested product categories or threaded forum comments).

01Definition & structure
A CTE (Common Table Expression) is defined using the WITH clause at the beginning of a SQL query. It allows you to define a temporary result set that you can SELECT, INSERT, UPDATE, or DELETE from later in the same query. By naming these intermediate steps, you replace deeply nested, inside-out subqueries with a top-to-bottom, linear flow of logic.
02How it works in practice
In a data transformation pipeline, a single SQL file might contain a chain of CTEs. The first CTE extracts raw JSON from a scraped payload. The second CTE parses that JSON into columns. The third CTE uses a window function to drop duplicate rows. Finally, the main SELECT statement at the bottom of the file joins the third CTE with a dimension table. The database query planner evaluates the entire chain and executes it as a single, optimized operation.
03Recursive CTEs for scraped hierarchies
Web data is rarely flat. Scraping a forum yields comments that reply to other comments; scraping a retailer yields categories nested inside departments. A WITH RECURSIVE CTE solves this by executing an initial anchor query (e.g., finding top-level comments), and then repeatedly executing a recursive term that joins back to the CTE itself (finding replies to those comments) until no more rows are returned.
04How DataFlirt uses CTEs
We use dbt (Data Build Tool) to manage the transformation layer of our scraping pipelines, and our style guide mandates that all models be written as a series of CTEs culminating in a single final SELECT. This modularity allows our data engineers to highlight a specific CTE in their IDE, run it in isolation, and instantly verify if a site's schema change broke the regex parsing step or the deduplication step.
05The performance misconception
A common misconception is that CTEs are always materialized in memory, making them slower than subqueries. In modern data warehouses (Snowflake, BigQuery) and recent versions of PostgreSQL, the query optimizer will automatically "inline" a CTE—treating it exactly like a subquery—if it is only referenced once. If it is referenced multiple times, the database may choose to materialize it to save redundant computation.
// 03 — query complexity

Measuring
transformation cost.

CTEs don't just improve readability; they change how query planners optimize execution. Here is how we evaluate transformation complexity in our delivery layer.

Cognitive Load Score = C = subqueries × 1.5 + joins
Lower is better. CTEs flatten nested subqueries into linear steps. Data Engineering heuristics
Recursive Depth Limit = Dmax = min(tree_depth, 100)
Prevents infinite loops when scraping cyclic threaded comments. PostgreSQL safety limits
Materialization Threshold = T = cte_refs > 1 ? materialize : inline
Postgres 12+ optimization logic for CTE execution. Database Query Planner
// 04 — dbt model execution

Executing a 5-stage
CTE pipeline.

A trace of a dbt model using CTEs to clean scraped product data. Each CTE isolates a specific transformation step before the final materialization.

dbt runPostgreSQL 15Data Cleaning
edge.dataflirt.io — live
CAPTURED
// executing model: stg_ecommerce_catalog.sql
compiler.parse: "5 CTEs detected"

// step 1: raw extraction
cte_raw_dedup: executed 14,200 rows
logic: "ROW_NUMBER() OVER (PARTITION BY url)"

// step 2: type casting
cte_price_cast: executed 14,188 rows
dropped: 12 rows // null price strings

// step 3: hierarchy flattening
cte_category_tree: executed // recursive join
max_depth: 4 levels

// step 4: final join & output
final_select: executed
materialization: "table"
target: "analytics.stg_ecommerce_catalog"
status: SUCCESS in 1.42s
// 05 — why we use them

Where subqueries
fail at scale.

Why data engineering teams mandate CTEs over nested subqueries for scraping transformation pipelines. Ranked by impact on pipeline maintainability.

PIPELINES ·  ·  ·  ·  ·   300+ active
DBT MODELS ·  ·  ·  ·  ·  4,200+
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Readability & Maintenance

Linear flow · Top-to-bottom logic replaces inside-out subqueries
02

Reusability within query

DRY SQL · Reference the same intermediate result multiple times
03

Recursive capabilities

Hierarchies · Essential for threaded comments and category trees
04

Debugging isolation

Modular testing · Select from any intermediate CTE to isolate bugs
05

Execution plan control

Optimization fences · Forcing materialization in older database engines
// 06 — delivery architecture

Modular transformations,

built for pipeline observability.

At DataFlirt, our delivery layer relies heavily on dbt models structured entirely around CTEs. Every scraped record passes through a standard sequence: raw extraction, deduplication, type casting, and business logic application. By isolating each step in its own CTE, a failure in the price normalization step doesn't obscure the raw data. We can query any intermediate CTE during debugging, ensuring that schema drift is caught and quarantined before it corrupts the final dataset.

stg_ecommerce_catalog.sql

Standard CTE structure for a product data transformation model.

cte_source SELECT * FROM raw_catalog
cte_dedup Window function partition
cte_clean_price Regex extractiontype cast
cte_enrich JOIN dim_categories
final_select SELECT * FROM cte_enrich
materialization table

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about CTEs, query performance, recursive scraping data, and transformation best practices.

Ask us directly →
What is the difference between a CTE and a Materialized View? +
A CTE is temporary and exists only for the duration of the specific query being executed. It is recalculated every time the query runs. A Materialized View is a physical table stored on disk that holds the result of a query, requiring manual or scheduled refreshes. CTEs are for query structure; Materialized Views are for caching expensive computations.
Do CTEs improve query performance? +
Not inherently. In fact, in older versions of PostgreSQL (pre-v12), CTEs acted as "optimization fences," forcing the database to materialize the CTE in memory before continuing, which could degrade performance. Modern databases generally inline CTEs just like subqueries unless they are referenced multiple times. Their primary benefit is developer ergonomics, not execution speed.
How do recursive CTEs help with web scraping? +
Many scraped datasets are hierarchical: threaded Reddit comments, nested e-commerce categories, or organizational charts. A recursive CTE allows you to query a parent record, then repeatedly join the table to itself to fetch all children, grandchildren, and so on, flattening a tree structure into a tabular format in a single query.
Are CTEs standard across all databases? +
Yes, the WITH clause is part of the SQL:1999 standard. It is fully supported by PostgreSQL, Snowflake, BigQuery, ClickHouse, SQL Server, and MySQL (8.0+). The syntax is nearly identical across all major analytical and transactional databases used in data engineering.
How does DataFlirt handle massive CTE chains in delivery? +
We enforce a strict style guide: no more than 5-7 CTEs per model. If a transformation requires more steps, it indicates the logic is too complex for a single model. We split it into multiple intermediate dbt models (ephemeral or view materializations) to maintain readability and allow the data warehouse to optimize the DAG effectively.
Does transforming scraped data via CTEs alter its legal status? +
Transforming raw scraped facts (like prices or specs) into aggregated metrics or novel structures via CTEs can create a net-new derived dataset. While raw facts are generally uncopyrightable, a heavily transformed, curated dataset often carries stronger intellectual property and database rights protections than the raw source material.
$ dataflirt scope --new-project --target=cte-(common-table-expression) READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h