Unveiling the Power of Ranking Functions in BigQuery
In the dynamic landscape of SQL, ranking functions play a pivotal role in data analysis, offering a spectrum of tools to understand the distribution, order, and relative position of data within a given window.
In this article, we’ll explore all the 6 SQL Ranking/Numbering functions. Through practical examples, we will show their impact and how you too can turn your queries into interesting data revelations.
RANKING functions Examples
In the following section we will explore 6 ranking/numbering functions through some practical examples, and we will do so using the sample dataset from GCP New York City Taxi Trips.
#1.RANK( ): The Foundation of Ranking
It assigns a unique rank to each row within a window, providing a straightforward method to identify the relative position of data. The
RANK()
operates on a 1-based system, making it a fundamental choice for scenarios where a simple ranking is required.
The objective of this query is to identify and showcase the top 3 ranked taxi trips for each payment method, taking tip amount into consideration. The RANK()
function handles ties by giving identical tip amounts the same rank and leaving gaps in the ranking sequence.
WITH RankedTrips AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY payment_type ORDER BY tip_amount DESC) AS tip_amount_rank
FROM
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022
)
SELECT
payment_method,
tip_amount_rank,
tip_amount
FROM
RankedTrips
WHERE
tip_amount_rank <= 3
#2.DENSE_RANK( ): Seamless Ranking without Gaps
It is similar to
RANK()
, but it eliminates gaps in the ranking sequence. TheDENSE_RANK()
assigns a unique, dense rank to each row within a window, ensuring a continuous ranking without skipping numbers.
This query is designed to identify the top 3 taxi trips for each payment method, considering the tip amount. Leveraging the DENSE_ANK()
function within each payment type partition ensures a continuous ranking without gaps.
WITH RankedTrips AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY payment_type ORDER BY tip_amount DESC) AS tip_amount_dense_rank
FROM
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022
)
SELECT
payment_method,
tip_amount_dense_rank,
tip_amount
FROM
RankedTrips
WHERE
tip_amount_dense_rank <= 3
#3.ROW_NUMBER(): Sequential Identification
It assigns a unique sequential integer to each row within a specified partition of a result set. The
ROW_NUMBER()
function in SQL is commonly used in scenarios where you need to assign a unique ranking or sequence to rows within a specified group or partition.
This query aims to find and display the top 3 trips for each payment method based on the tip amount. The ROW_NUMBER()
function is used to rank the trips within each payment type partition, and the final result set includes the payment method, tip amount rank, and tip amount for each qualifying trip.
WITH RankedTrips AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY payment_type ORDER BY tip_amount DESC) AS tip_amount_rank
FROM
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022
)
SELECT
payment_method,
tip_amount_rank,
tip_amount
FROM
RankedTrips
WHERE
tip_amount_rank <= 3
All the above numbering functions return the same results in this case, it implies that there are no duplicate tip_amount values for any given payment type. Therefore, the strict sequential ordering provided by ROW_NUMBER()
, the identical ranks for tied values provided by RANK()
, and the dense ranking without gaps provided by DENSE_RANK()
all produce the same output.
#4.PERCENT_RANK( ): Quantifying Relative Position
It calculates the percentile rank of each row within a window, providing a normalized measure ranging from 0 to 1. The
PERCENT_RANK()
is particularly useful when assessing the relative position of data in a distribution.
This query delves into the exploration of the top 3 percent of taxi trips for each payment method, taking tip amount into account. By employing the PERCENT_RANK()
function within specific payment type partitions, it provides a normalized measure ranging from 0 to 1
WITH RankedTrips AS (
SELECT *,
PERCENT_RANK() OVER (PARTITION BY payment_type ORDER BY tip_amount DESC) AS tip_amount_percent_rank
FROM
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022
WHERE
tip_amount IS NOT NULL
)
SELECT
payment_method,
tip_amount_percent_rank,
tip_amount
FROM
RankedTrips
WHERE
payment_type = '1' -- Assuming we are interested only in tips paid by CC
AND tip_amount_percent_rank <= 0.05; -- Assuming you want the top 5 percent
#5.CUME_DIST( ): Cumulative Distribution Unveiled
It computes the cumulative distribution of each row within a window, revealing its relative position in the entire dataset. The
CUME_DIST()
delivers a value between 0 and 1, indicating the cumulative percentage of data less than or equal to the current row.
The goal of this query is to understand the cumulative distribution of tip amounts for the top 5 percent of taxi trips within each payment method. CUME_DIST()
facilitates this by computing the cumulative percentage of tip amounts less than or equal to the current row within payment type partitions.
WITH RankedTrips AS (
SELECT *,
CUME_DIST() OVER (PARTITION BY payment_type ORDER BY tip_amount DESC) AS tip_amount_cume_dist
FROM
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022
)
SELECT
payment_method,
tip_amount_cume_dist,
tip_amount
FROM
RankedTrips
WHERE
payment_type = '1' -- Assuming we are interested only in tips paid by CC
AND tip_amount_cume_dist <= 0.05; -- Assuming you want the top 5 percent
#6.NTILE( ): Dividing Data into Quantiles
It breaks down the dataset into quantile buckets, allowing for the classification of rows based on their relative position. This function assigns a bucket number (1-based) to each row, facilitating the segmentation of data into equal parts.
In this query, we categorize taxi trips into quantile buckets, focusing on the top quantile for Credit Card payment method, considering tip amount.
WITH RankedTrips AS (
SELECT *,
NTILE(3) OVER (PARTITION BY payment_type ORDER BY tip_amount DESC) AS tip_amount_ntile
FROM
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022
WHERE tip_amount IS NOT NULL
)
SELECT
payment_method,
tip_amount_ntile,
tip_amount
FROM
RankedTrips
WHERE
payment_type = '1' -- Assuming we are interested only in tips paid by CC
AND tip_amount_ntile = 1; -- Assuming you want the first quantile
Conclusion:
Ranking functions are essential tools in the world of SQL for deciphering the complexities of data distribution and order. Each function, which ranges from straightforward rankings to cumulative distributions and percentile measurements, has a specific role and helps to extract meaningful information from a set of data.
If you want to learn more about WINDOW Functions in general, make sure to read my article here on Medium
Feel free to reach to me on LinkedIn if you have any other question: Alessandro Pizzini | LinkedIn