← Glossary / Window Function

What is Window Function?

Window function is a SQL feature that performs calculations across a set of table rows related to the current row, without collapsing them into a single output row like an aggregate function would. In data engineering pipelines, it's the primary tool for computing running totals, moving averages, and deduplicating scraped records by ranking them chronologically. If you're extracting time-series pricing or tracking inventory changes, window functions are what turn raw snapshots into actionable deltas.

SQLData EngineeringDeduplicationTime-SeriesAnalytics

// 02 — definitions

Look across
the partition.

How to compute aggregations and rankings over a subset of data without losing the row-level granularity of your scraped records.

Ask a DataFlirt engineer →

TL;DR

Window functions let you run aggregate-like operations (SUM, RANK, LEAD) over a "window" of rows defined by an OVER() clause. Unlike GROUP BY, they don't squash your dataset. They are essential for deduplicating scraped data, calculating price changes over time, and backfilling missing values in time-series datasets.

01Definition & structure

A window function performs a calculation across a set of table rows that are somehow related to the current row. The syntax relies on the OVER() clause, which defines the "window" or frame of rows. This clause typically contains a PARTITION BY (to group rows into buckets) and an ORDER BY (to sort rows within those buckets).

02How it works in practice

In the SQL order of operations, window functions are evaluated very late—after WHERE, GROUP BY, and HAVING, but before the final ORDER BY. Because they don't collapse rows, they are perfect for appending aggregate metrics (like a running total or a rank) directly onto the original row data.

03Deduplication in scraping pipelines

Scraping pipelines often fetch the same URL multiple times. To find the most recent valid data point for a given product, engineers use ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY scraped_at DESC) inside a CTE, and then filter for row_num = 1 in the outer query. This guarantees exactly one row per product, representing the latest state.

04How DataFlirt handles it

We rely on window functions to build our delivery views. Our raw ingestion tables are append-only ledgers. When a client requests a daily snapshot, our dbt models use window functions to dynamically resolve the latest state, calculate price deltas using LAG(), and flag anomalies based on moving averages. This keeps our raw data immutable while delivering clean, deduplicated datasets.

05Did you know?

If you omit the ORDER BY clause inside your OVER() statement, the default window frame is the entire partition. But if you include an ORDER BY, the default frame changes to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This subtle default behavior is the cause of countless bugs when calculating running totals versus grand totals.

// 03 — the syntax

How to define
the window.

Window functions operate on a frame of rows. The syntax defines how that frame is partitioned, ordered, and bounded. Here are the core patterns we use in DataFlirt's transformation layer.

Deduplication (Latest Record) = ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp DESC)

Assigns 1 to the most recent scrape per product. Filter by rn=1 in an outer query. Standard SQL

Price Delta = price - LAG(price) OVER (PARTITION BY id ORDER BY date)

Subtracts the previous row's price from the current row's price. Time-series analysis

Moving Average = AVG(price) OVER (PARTITION BY id ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)

Calculates a 7-day rolling average (current + 6 previous). Standard SQL

// 04 — pipeline execution

Deduplicating 2M
scraped records.

A dbt model execution trace showing a window function being used to extract the latest valid price for each product from a raw append-only scraping table.

dbtSnowflakeSQL

edge.dataflirt.io — live

CAPTURED

-- compiling model: stg_latest_prices
WITH ranked_records AS (
SELECT product_id, price, scraped_at,
ROW_NUMBER() OVER (
PARTITION BY product_id
ORDER BY scraped_at DESC
) as rn
FROM raw_scrapes
)
SELECT * FROM ranked_records WHERE rn = 1;

-- execution plan
[node 1] TableScan: raw_scrapes (2,104,992 rows)
[node 2] WindowAgg: PARTITION BY product_id
[node 3] Filter: rn = 1

-- results
rows_processed: 2,104,992
rows_output: 412,850
status: SUCCESS (1.42s)

// 05 — common functions

The functions
we use most.

Ranked by frequency of use across DataFlirt's internal dbt models and client delivery views. Deduplication and state-change detection dominate our transformation workloads.

MODELS ANALYZED · · · 450+

DIALECT · · · · · · Snowflake / Postgres

UPDATED · · · · · · 2026-05-19

01

ROW_NUMBER() / RANK()

94% of models · Deduplication, top-N filtering

02

LAG() / LEAD()

78% of models · Time-series deltas, state changes

03

SUM() / AVG()

62% of models · Running totals, moving averages

04

FIRST_VALUE()

45% of models · Sessionization, boundary values

05

NTILE()

21% of models · Percentile bucketing, deciles

// 06 — transformation layer

Append-only ingestion,

resolved at query time.

DataFlirt's extraction pipelines never UPDATE existing records. We use an append-only architecture where every scrape is an INSERT. This preserves full historical lineage but creates massive duplication. Window functions are the bridge between this raw immutable ledger and the clean, deduplicated datasets we deliver to clients. By applying ROW_NUMBER() over the ingestion timestamp, we project the current state of the world without destroying the past.

dbt model execution

Metrics from the daily deduplication run on a retail pricing dataset.

model.name dim_product_pricing_current

input.rows 14,205,881

window.partition product_id, store_id

window.order scraped_at DESC

output.rows 845,210

compute.time 4.2s

data.freshness < 15m

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About window functions, deduplication strategies, performance implications, and how DataFlirt models time-series data.

Ask us directly →

What is the difference between GROUP BY and a window function? +

GROUP BY collapses all rows that share the same grouping key into a single output row. A window function performs an aggregate calculation over a set of rows but preserves the original row-level granularity. If you have 100 rows and use a window function, your output will still have 100 rows, just with a new computed column appended.

Why use ROW_NUMBER() instead of MAX(date) for deduplication? +

If you use GROUP BY product_id and MAX(scraped_at), you only get the ID and the date. If you want the price, stock status, and title associated with that specific date, you have to join the result back to the original table. ROW_NUMBER() lets you rank the rows and filter for rank 1 in a single pass, keeping all other columns intact.

Are window functions slow on large scraped datasets? +

They can be. Window functions require sorting data within partitions. If your partitions are massive and don't fit in memory, the database will spill the sort operation to disk, which kills performance. Proper clustering keys (e.g., clustering your Snowflake table by product_id) mitigate this significantly.

How does DataFlirt handle slowly changing dimensions (SCD)? +

We use LAG() and LEAD() window functions to detect when a scraped attribute (like price or stock status) actually changes. Instead of storing a row for every day we scraped the product, we generate valid_from and valid_to timestamps, compressing a daily feed into a compact historical ledger.

Can I use window functions in my WHERE clause? +

No. In the SQL order of execution, window functions are evaluated after the WHERE, GROUP BY, and HAVING clauses. If you want to filter based on the result of a window function (like WHERE row_num = 1), you must wrap the window function in a CTE (Common Table Expression) or a subquery first.

Do all databases support window functions? +

Today, yes. Modern analytical databases like Snowflake, BigQuery, and ClickHouse are heavily optimized for them. Postgres has excellent support. MySQL added them in version 8.0, and SQLite/DuckDB support them natively. If you are building a modern data stack, you can assume window functions are available.

$ dataflirt scope --new-project --target=window-function READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h