Unveiling the Power of Ranking Functions in BigQuery

Alessandro Pizzini
6 min readJan 31, 2024

--

Photo by Martin Sanchez on Unsplash

In the dynamic landscape of SQL, ranking functions play a pivotal role in data analysis, offering a spectrum of tools to understand the distribution, order, and relative position of data within a given window.

In this article, we’ll explore all the 6 SQL Ranking/Numbering functions. Through practical examples, we will show their impact and how you too can turn your queries into interesting data revelations.

Photo by Edoardo Busti on Unsplash

RANKING functions Examples

In the following section we will explore 6 ranking/numbering functions through some practical examples, and we will do so using the sample dataset from GCP New York City Taxi Trips.

#1.RANK( ): The Foundation of Ranking

It assigns a unique rank to each row within a window, providing a straightforward method to identify the relative position of data. The RANK() operates on a 1-based system, making it a fundamental choice for scenarios where a simple ranking is required.

The objective of this query is to identify and showcase the top 3 ranked taxi trips for each payment method, taking tip amount into consideration. The RANK() function handles ties by giving identical tip amounts the same rank and leaving gaps in the ranking sequence.

WITH RankedTrips AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY payment_type ORDER BY tip_amount DESC) AS tip_amount_rank
FROM
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022
)
SELECT
payment_method,
tip_amount_rank,
tip_amount
FROM
RankedTrips
WHERE
tip_amount_rank <= 3

#2.DENSE_RANK( ): Seamless Ranking without Gaps

It is similar to RANK(), but it eliminates gaps in the ranking sequence. The DENSE_RANK() assigns a unique, dense rank to each row within a window, ensuring a continuous ranking without skipping numbers.

This query is designed to identify the top 3 taxi trips for each payment method, considering the tip amount. Leveraging the DENSE_ANK() function within each payment type partition ensures a continuous ranking without gaps.

WITH RankedTrips AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY payment_type ORDER BY tip_amount DESC) AS tip_amount_dense_rank
FROM
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022
)
SELECT
payment_method,
tip_amount_dense_rank,
tip_amount
FROM
RankedTrips
WHERE
tip_amount_dense_rank <= 3

#3.ROW_NUMBER(): Sequential Identification

It assigns a unique sequential integer to each row within a specified partition of a result set. The ROW_NUMBER() function in SQL is commonly used in scenarios where you need to assign a unique ranking or sequence to rows within a specified group or partition.

This query aims to find and display the top 3 trips for each payment method based on the tip amount. The ROW_NUMBER() function is used to rank the trips within each payment type partition, and the final result set includes the payment method, tip amount rank, and tip amount for each qualifying trip.

WITH RankedTrips AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY payment_type ORDER BY tip_amount DESC) AS tip_amount_rank
FROM
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022
)
SELECT
payment_method,
tip_amount_rank,
tip_amount
FROM
RankedTrips
WHERE
tip_amount_rank <= 3

All the above numbering functions return the same results in this case, it implies that there are no duplicate tip_amount values for any given payment type. Therefore, the strict sequential ordering provided by ROW_NUMBER(), the identical ranks for tied values provided by RANK(), and the dense ranking without gaps provided by DENSE_RANK()all produce the same output.

#4.PERCENT_RANK( ): Quantifying Relative Position

It calculates the percentile rank of each row within a window, providing a normalized measure ranging from 0 to 1. The PERCENT_RANK()is particularly useful when assessing the relative position of data in a distribution.

This query delves into the exploration of the top 3 percent of taxi trips for each payment method, taking tip amount into account. By employing the PERCENT_RANK()function within specific payment type partitions, it provides a normalized measure ranging from 0 to 1

WITH RankedTrips AS (
SELECT *,
PERCENT_RANK() OVER (PARTITION BY payment_type ORDER BY tip_amount DESC) AS tip_amount_percent_rank
FROM
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022
WHERE
tip_amount IS NOT NULL
)
SELECT
payment_method,
tip_amount_percent_rank,
tip_amount
FROM
RankedTrips
WHERE
payment_type = '1' -- Assuming we are interested only in tips paid by CC
AND tip_amount_percent_rank <= 0.05; -- Assuming you want the top 5 percent

#5.CUME_DIST( ): Cumulative Distribution Unveiled

It computes the cumulative distribution of each row within a window, revealing its relative position in the entire dataset. The CUME_DIST()delivers a value between 0 and 1, indicating the cumulative percentage of data less than or equal to the current row.

The goal of this query is to understand the cumulative distribution of tip amounts for the top 5 percent of taxi trips within each payment method. CUME_DIST()facilitates this by computing the cumulative percentage of tip amounts less than or equal to the current row within payment type partitions.

WITH RankedTrips AS (
SELECT *,
CUME_DIST() OVER (PARTITION BY payment_type ORDER BY tip_amount DESC) AS tip_amount_cume_dist
FROM
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022
)
SELECT
payment_method,
tip_amount_cume_dist,
tip_amount
FROM
RankedTrips
WHERE
payment_type = '1' -- Assuming we are interested only in tips paid by CC
AND tip_amount_cume_dist <= 0.05; -- Assuming you want the top 5 percent

#6.NTILE( ): Dividing Data into Quantiles

It breaks down the dataset into quantile buckets, allowing for the classification of rows based on their relative position. This function assigns a bucket number (1-based) to each row, facilitating the segmentation of data into equal parts.

In this query, we categorize taxi trips into quantile buckets, focusing on the top quantile for Credit Card payment method, considering tip amount.

WITH RankedTrips AS (
SELECT *,
NTILE(3) OVER (PARTITION BY payment_type ORDER BY tip_amount DESC) AS tip_amount_ntile
FROM
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022
WHERE tip_amount IS NOT NULL
)
SELECT
payment_method,
tip_amount_ntile,
tip_amount
FROM
RankedTrips
WHERE
payment_type = '1' -- Assuming we are interested only in tips paid by CC
AND tip_amount_ntile = 1; -- Assuming you want the first quantile

Conclusion:

Ranking functions are essential tools in the world of SQL for deciphering the complexities of data distribution and order. Each function, which ranges from straightforward rankings to cumulative distributions and percentile measurements, has a specific role and helps to extract meaningful information from a set of data.

If you want to learn more about WINDOW Functions in general, make sure to read my article here on Medium

Feel free to reach to me on LinkedIn if you have any other question: Alessandro Pizzini | LinkedIn

--

--