How to Calculate Median the Right Way in PostgreSQL
Let me show you 2 different ways!
Intro
Calculating the median has never seemed to be straight forward in PostgreSQL. There is no function available that directly calculates it like Python’s median function. So today let me share with you how to calculate the median in PostgreSQL.
What is median
The median is “the middle” value separating the higher half from the lower half of a data sample (ascending order). It is usually thought of as a “typical” value (like a mean) and sometimes as its better representation depending on the dataset because the median is less likely to be skewed by a small proportion of extremely large or small values compared to the mean.
How to calculate median
Unfortunately, calculating the median is not straight forward. We have to consider 2 cases depending on whether a dataset has an odd or even number of observations.
CASE1:
If dealing with an odd number of observations, the median is literally “the middle value” of the dataset. This is easy.
Ex. dataset = 2, 4, 6, 8, 10, 12, 14
Median = 8
CASE2:
If dealing with an even number of observations, there is no value that sits in “middle”. So the median is defined to be the arithmetic mean of the two middle values. So we have to actually “find and calculate” them.
Ex. dataset = 2, 4, 6, 8, 10, 12, 14, 16
Median = (8+10)/2 =9
How to calculate median in PostgreSQL
Ok we understand what the median is and how it is calculated. Now let’s talk about how to get it in PostgreSQL. And for now let’s use the dataset from Case 2. So we created a table with 8 records (even number of observations) in PostgreSQL. There are mainly 2 ways to get the median from this table.
No.1 Use PERCENTILE_CONT(0.5)
SELECT
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY y) AS median
FROM
dataset
;+--------+
| median |
+--------+
| 9 |
+--------+
PERCENTILE_CONT()
calculates a percentile based on a continuous distribution of the column. So if you specify 0.5 as its argument, it returns the median using interpolation between two middle adjacent values if the number of observation is even (and returns just the middle value if odd).
No.2 Write from scratch using Window functions!
WITH get_median AS (
SELECT
y
,row_number() OVER(ORDER BY y ASC) AS rn_asc
,COUNT(*) OVER() AS ct
FROM dataset
)
SELECT
AVG(y) AS median
FROM
get_median
WHERE
rn_asc BETWEEN ct/2.0 AND ct/2.0 + 1
;+--------+
| median |
+--------+
| 9 |
+--------+
Just like PERCENTILE_CONT(0.5)
we can get the same median value by using a combination of window functions. And as you can see in the WHERE rn_asc BETWEEN ct/2.0 AND ct/2.0 + 1
, this query interpolates between two middle values for an even number of observations, otherwise just finds the middle value.
So which one should you use, Use PERCENTILE_CONT(0.5)? or
Write from scratch?
We looked at 2 examples of calculating the median, but which one should you use? Use PERCENTILE_CONT(0.5)
is my answer, because it looks easier to read with only a few lines of code and seems to execute faster! Please take a look at and compare the performance from QUERY PLAN.
Final thoughts
In this post, we talked about how there is no straight forward function that directly calculates the median in PostgreSQL like Python’s median function. Alternatively we can easily get it by leveraging PERCENTILE_CONT(0.5). I hope you learned something new!
References
- P. Pedamkar, PostgreSQL Median, EDUCBA
Go from SELECT * to interview-worthy project. Get our free 5-page guide.