Photo by Christopher Burns on Unsplash

Data Engineer’s Arsenal: 4 Must-Know SQL Window Functions

Alessandro Pizzini
5 min readJan 26, 2024

--

For a Data Engineer, the proficiency in SQL isn’t just a skill; it’s an essential tool that unlocks the ability to effortlessly manipulate and analyze data. However, delving deeper and mastering SQL window functions can elevate this expertise to superhero status.

In this article, we’ll explore the power of mastering 4 key SQL window functions. Through practical examples, we will show their impact and how you too can turn your queries into interesting data revelations.

What are WINDOW functions

Window functions in SQL are “specialized tools” that operate over a window (range of rows) defined using the OVER clause, offering a powerful way to analyze and manipulate data. The following are examples of what can be calculated using winodw functions:

  • Ranking — Assigns a unique rank to each row based on specified criteria, enabling the arrangement of data in a meaningful order.
  • Running Totals — Computes the cumulative sum of a designated column, useful for tracking ongoing trends.
  • Moving Average — Calculates the average of a specified column within a moving window of rows, useful to highlighting patterns in the data.

Window function can be defined as follow:

SELECT 
window_function() OVER(
[PARTITION BY column_name1]
[ORDER BY column_name2]) AS window_column_alias
FROM table_name
Photo by Edoardo Busti on Unsplash

WINDOW functions Examples

In the following section we will explore 4 of the most commonly used window functions through some practical examples, and we will do so using the sample dataset from GCP New York City Taxi Trips.

#1. ROW_NUMBER( )

It assigns a unique sequential integer to each row within a specified partition of a result set. The ROW_NUMBER() function in SQL is commonly used in scenarios where you need to assign a unique ranking or sequence to rows within a specified group or partition.

We can use the function to rank the trip according the tip_amount.

This query aims to find and display the top 3 trips for each payment method based on the tip amount. The ROW_NUMBER() function is used to rank the trips within each payment type partition, and the final result set includes the payment method, tip amount rank, and tip amount for each qualifying trip.

WITH RankedTrips AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY payment_type ORDER BY tip_amount DESC) AS tip_amount_rank
FROM
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022
)
SELECT
payment_method,
tip_amount_rank,
tip_amount
FROM
RankedTrips
WHERE
tip_amount_rank <= 3

If you want to learn more about Ranking/Numbering functions read my article here on Medium

#2. FIRST_VALUE( )

It retrieves the first value of an expression within a specified partition or the entire result set, ordered by a specified column. The FIRST_VALUE() is particularly useful when you want to identify the first occurrence or the earliest value in a series, providing insights into the earliest occurrence of a certain attribute within distinct groups or the entire dataset.

We can use the function to calculate the difference between all the trips and the most expensive one.

This query calculates the difference (‘delta’) between the total amount of each taxi trip and the total amount of the most expensive trip within the same vendor category. It uses the FIRST_VALUE() window function to identify the most expensive trip in each vendor category.

SELECT
vendor_id,
pickup_datetime,
dropoff_datetime,
total_amount
FIRST_VALUE(total_amount) OVER (PARTITION BY vendor_id ORDER BY total_amount DESC) -total_amount AS delta_most_expensive,
FROM
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022
ORDER BY total_amount DESC

#3. SUM( ) OVER (PARTITION BY )

It calculates a running sum of a specified numeric column within a defined partition of the result set. The SUM()window function allows for the generation of a dynamic running sum that continuously accumulates the values of the designated column as it traverses the rows.

We can leverage the function to calculate the running total distance covered by each taxi trip within its respective vendor category.

This query computes a running total of the trip distances for each taxi trip within its corresponding vendor category. The SUM() window function is applied over a specified partition defined by vendor_id and ordered by pickup_datetime, resulting in a dynamic running sum that accumulates the trip distances as rows are processed.

SELECT
vendor_id,
pickup_datetime,
dropoff_datetime,
trip_distance,
SUM(trip_distance) OVER (PARTITION BY vendor_id ORDER BY pickup_datetime) AS running_total_distance
FROM
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022
ORDER BY
vendor_id, pickup_datetime

#4. LEAD

It evaluates the value of a specified column for the preceding row within a defined partition of the result set. TheLAG() window function enables the retrieval of the value from the previous row, offering insights into the historical state of the designated column as the result set is processed.

We can utilize the function to assess the difference in trip distances.

This query calculate the difference in trip distances between each taxi trip and its immediate predecessor within the same vendor category. The LAG() window function is applied over a specified partition defined by vendor_id and ordered by pickup_datetime, allowing for the retrieval of the previous trip’s distance.

SELECT
vendor_id,
pickup_datetime,
dropoff_datetime,
trip_distance,
LAG(trip_distance) OVER (PARTITION BY vendor_id ORDER BY pickup_datetime) AS previous_trip_distance
FROM
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022
ORDER BY
vendor_id, pickup_datetime

Conclusion

To sum up, knowing how to use SQL window functions is essential for data engineers since it gives them strong tools for efficient data manipulation and analysis. This article’s four main window functions — ROW_NUMBER(), FIRST_VALUE(), SUM(), and LAG() — provide useful functions for sorting, getting particular values, computing running sums, and evaluating previous states of data inside of designated partitions.

Feel free to reach to me on LinkedIn if you have any other question: Alessandro Pizzini | LinkedIn

--

--