Mastering SQL Window Functions: The Secret to Acing Data Engineering Interviews!

Nnaemezue Obi-Eyisi
7 min readMar 23, 2024

--

Some of the most commonly used functions in SQL programming for data engineering workloads are window functions. They are also one of the most common concepts tested during data engineering SQL coding assessments. By the end of this blog post, my goal is to explain and solidify your understanding of window functions so that you will never fail any coding test related to them!

Why SQL Window Functions

In data engineering, we’re continually transforming and cleaning data, and applying business logic before presenting it to our consumers. Window functions play a crucial role in facilitating these tasks.

Below are the most common use cases of window functions in my 10 years of data engineering career

  1. Eliminating duplicate records - After ranking the records we can eliminate a record based on a criteria
  2. Calculating Running Totals: Window functions facilitate the calculation of running totals or cumulative sums over ordered partitions, useful for analyzing trends or monitoring cumulative progress over time.
  3. Performing Lead and Lag Analysis: Window functions enable comparisons between current and preceding or succeeding rows within a partition, making them ideal for analyzing trends, identifying patterns, or detecting anomalies.
  4. Calculating Moving Averages: Window functions allow for the calculation of moving averages over specified window sizes, enabling smoother trend analysis and noise reduction in time-series data.
  5. Detecting Anomalies: Window functions can be used to identify outliers or anomalies by comparing data points to a moving average or other statistical benchmarks within defined partitions.
  6. Implementing Sessionization: Window functions are essential for sessionizing user activity data, facilitating the segmentation of user sessions based on specified criteria such as session duration or inactivity periods.
  7. Performing Time Series Analysis: Window functions enable various time series analysis tasks, including calculating rolling averages, identifying seasonality patterns, or detecting trends and cycles in temporal data.
  8. Handling Data Imputation: Window functions can assist in data imputation tasks by filling missing values with calculated values based on neighboring rows within defined partitions.

To understand window functions we need to ensure we understand GROUP BY and ORDER BY, because they share the same fundamental concept.

What are SQL Window Functions

SQL window functions are a category of functions that perform calculations across a set of rows related to the current row within a defined window or partition of the result set. Unlike traditional aggregate functions, which collapse multiple rows into a single value, window functions operate on individual rows while maintaining the overall result set’s structure.

These functions allow for advanced analytics, such as ranking rows, calculating running totals, aggregating data within specified partitions, and performing lead/lag analysis. SQL window functions are commonly used for tasks such as trend analysis, data partitioning, sessionization, and time series analysis.

Key components of SQL window functions include:

  1. Partitioning: Dividing the result set into distinct groups or partitions based on specified criteria.
  2. Ordering: Determining the order of rows within each partition for calculations that depend on row order.
  3. Window Frame: Defining the range of rows considered in the calculation, typically relative to the current row, especially for Lead()/Lag()

Commonly used Window Functions in the Real world

ROW_NUMBER(), RANK()/DENSE_RANK(), LEAD()/LAG(), SUM()

How SQL Window Function works

Let’s dive into SQL window functions with some explanations and examples.

Given the below sample data that contains CustomerName, Address, and Moved in Date.

If we want to identify the records with the latest address for each customer based on their Moved In Date, the SQL code that accomplishes this is:

Row_number() over (partition by Customername order by movedindate desc) as Rownumber
  1. Understand the Partition BY: Partition by segregates data based on the partition fields. In the above example, we can use Partition by to segregate the data based on each unique Customer Name. The resulting output will be as follows:

2. Understand the Order BY: Order by simply sorts the data based on the specified field, either in ascending or descending order. In this example, I am sorting the MovedInDate field in descending order.

3. Apply the specific window function to the dataset in step 2. In this case, the row_number() function will assign consecutive integers starting from 1 to each row per partition

4. The final result is the collation of the different sets into one dataset

Quick Debugging Tip

One way to confirm the accuracy of your result set is to order the dataset using the ORDER BY clause by the partition key columns and sorting columns. This allows you to visually inspect and confirm that the window functions have produced accurate results.

About LEAD()/LAG()

The beauty of Lead/Lag lies in its ability to compare a subsequent or previous row to the current row. By default, it compares with the very next row. However, we can also specify to compare with the prior N rows or subsequent N rows. Nevertheless, you need to denote the field you want to move up or down inside the lead/lag function.

LEAD(movedindate, 1) over (partition by Customername order by movedindate asc) as Lead_movedindate

Caveats to note about Window functions

Window functions like ROW_NUMBER(), RANK(), DENSE_RANK(), LEAD(), and LAG() require an ORDER BY clause inside the function, while the PARTITION BY clause is optional.

Aggregate window functions like SUM(), AVG(), MIN(), and MAX() do not require either a PARTITION BY or ORDER BY clause, but you must have at least one to enable these aggregate functions to behave as window functions. However, it’s important to note that when we use aggregate window functions without an ORDER BY clause, they behave differently compared to when we have both PARTITION BY and ORDER BY clauses specified. Check out the section below to see the differences.

Practical Examples of Implementing Window Functions

1. Aggregating Data Within Partitions:

Window functions can perform aggregate calculations within defined partitions, allowing for more granular analysis without collapsing the result set.

Example: Calculate the total revenue per product category.

SELECT
category,
product,
revenue,
SUM(revenue) OVER (PARTITION BY category) AS total_revenue_per_category
FROM
sales_data;

In this example, the SUM() window function calculates the total revenue per product category by partitioning the data by the 'category' column. Note this will be different from the cumulative sum example.

2. Calculating Rankings:

Window functions can assign ranks to rows based on specified criteria within partitions, providing insights into data distributions.

Example: Rank products by sales volume within each product category.

SELECT
category,
product,
sales_volume,
RANK() OVER (PARTITION BY category ORDER BY sales_volume DESC) AS sales_rank
FROM
sales_data;

Here, the RANK() window function assigns ranks to products based on their sales volume within each product category.

3. Calculating Running Totals:

Window functions can compute running totals or cumulative sums over ordered partitions, useful for analyzing trends or monitoring cumulative progress.

Example: Calculate the cumulative revenue over time for each product.

SELECT
product,
date,
revenue,
SUM(revenue) OVER (PARTITION BY product ORDER BY date) AS cumulative_revenue
FROM
sales_data;

In this query, the SUM() window function calculates the cumulative revenue for each product by ordering the data by date.
Notice that is different from the aggregating within the partition based on example 1

4. Performing Lead and Lag Analysis:

Window functions enable comparisons between current and preceding or succeeding rows within partitions, facilitating trend analysis and pattern detection.

Example: Calculate the percentage change in revenue compared to the previous month for each product.

SELECT
product,
date,
revenue,
(revenue - LAG(revenue, 1) OVER (PARTITION BY product ORDER BY date)) / LAG(revenue, 1) OVER (PARTITION BY product ORDER BY date) AS revenue_percentage_change
FROM
sales_data;

Here, the LAG() window function retrieves the revenue from the previous month, allowing us to compute the percentage change in revenue.

5. Handling Data Imputation:

Window functions can assist in data imputation tasks by filling missing values with calculated values based on neighboring rows within defined partitions.

Example: Fill missing values in the ‘revenue’ column with the average revenue for each product.

SELECT
product,
date,
COALESCE(revenue, AVG(revenue) OVER (PARTITION BY product)) AS imputed_revenue
FROM
sales_data;

In this query, the AVG() window function computes the average revenue for each product, and the COALESCE() function replaces missing revenue values with the calculated averages.

Conclusion

In summary, SQL window functions are indispensable tools in data engineering, offering powerful capabilities for performing advanced analytics, aggregations, and transformations on structured data sets. Their versatility makes them valuable for a wide range of data processing and analysis tasks in various industries and domains.

Follow me on: LinkedIn | All Platforms

To Learn Azure Data Engineering with Databricks, and join the waitlist: Click here

About Me

I am Nnaemezue Obi-eyisi, a Senior Azure Databricks Data Engineer at Capgemini and the founder of AfroInfoTech, an online coaching platform for Azure data engineers specializing in Databricks.

--

--

Nnaemezue Obi-Eyisi

I am passionate about empowering, educating, and encouraging individuals pursuing a career in data engineering. Currently a Senior Data Engineer at Capgemini