Percentile Calculation with Linear Interpolation - The Math behind numpy’s percentile function

Learn how the percentile function from the numpy library calculates percentiles using linear interpolation.

Better Everything
5 min readMay 24, 2023
Learn how the percentile function from the numpy library calculates percentiles using linear interpolation.
Learn how the percentile function from the numpy library calculates percentiles using linear interpolation. Image by catalyststuff on Freepik

A set of values can be divided by using percentiles.

For example, 20% of the values in a dataset are smaller than the 20th percentile and 75% of the values are smaller than the 75th percentile.

You might know that the Python package numpy has a function percentile with which you can calculate percentiles.

But maybe you have ever wondered:

How does the percentile function from numpy work? Or, how are percentiles actually calculated with numpy’s percentile function?

In this guide I will try to explain how the percentile function uses Linear Interpolation to do so!

First, let’s look at an example of how the percentile function is used:

import numpy as np

data = [4, 2, 5, 4, 5, 6, 4, 4, 4, 5, 8, 4, 12, 5, 20, 5, 4, 8, 4, 3]
upper_quartile = np.percentile(data,75)

print(upper_quartile)

The code above prints: 5.25.

But how is this calculated?

Well, the default method that numpy’s percentile function uses is 'linear'. This refers to linear interpolation, more on that later.

Let’s go step-by-step!

Step 1 — Determine the index where the percentile can be found.

An index of a value in a sequence refers to that value’s position in the sequence.

In the example above we calculated the 75th percentile. To determine at what index we have to look at, we can use:

  • percentile as a fraction.
    In our case: 0.75.
  • distance as the difference between the minimum and maximum indices. When there are 20 data points — like in our example data — the min index is 0 and max index is 19 and the difference is 19. The distance can be calculated by taking the length of a dataset and subtracting 1.
    In our case: len(data) — 1.

The index where the percentile can be found is calculated by multiplying percentile by distance.

data = [4, 2, 5, 4, 5, 6, 4, 4, 4, 5, 8, 4, 12, 5, 20, 5, 4, 8, 4, 3]
percentile = 0.75
distance = len(data) - 1

index = percentile * distance
print(index)

The code above prints: 14.25.

So the 75th percentile of data can be found at index 14.25.

But indices are supposed to be integer values. So what to do with 14.25?

This is where linear interpolation comes in!

Note: since a decimal index doesn’t really make sense, it may be refered to as a virtual index.

What is linear interpolation?

Linear interpolation is a method in which you construct a line between two data points and use it to determine unknown values.

When the data is 2 dimensional, you can draw the line on a X-Y plot.

A line between two datapoints that can be used for linear interpolation.
A line between two datapoints that can be used for linear interpolation. Source: own image.

This is useful when you have a datapoint from which you only know the value of one dimension and want to determine the value of the other dimension.

A line between two datapoints that can be used for linear interpolation.
A line between two datapoints that can be used for linear interpolation. Source: own image.

In our example above, we don’t know what value corresponds to the virtual index 14.25. But luckily there is a datapoint in between two known datapoints, one with index 14 and one with index 15, so we can use linear interpolation in this case.

By checking what Y-value is on the line where X = 14.25 we can find the unknown value.

In the next step we will determine the unknown value with linear interpolation mathematically.

Step 2 — Use Linear Interpolation to determine the percentile

The formula to find the percentile with linear interpolation is:

percentile = a + (b-a) * frac

But what to fill in for a, b and frac?

  • a is the value of the datapoint on the left of the virtual index, so the value at index 14.
  • b is the value of the datapoint on the right of the virtual index, so the value at index 15.
  • frac is the decimal part of the virtual index 14.25 so 0.25.

It is important to sort the data before looking up the values at the indices, because percentiles are based on sorted data.

Let’s write some Python code to see if we can get the same value as the percentile function:

data = [4, 2, 5, 4, 5, 6, 4, 4, 4, 5, 8, 4, 12, 5, 20, 5, 4, 8, 4, 3]
percentile = 0.75
distance = len(data) - 1
virtual_index = percentile * distance

sorted_data = sorted(data)
print(sorted_data)

i = int(virtual_index) #Value a can be found at this index: int(14.25) = 14
j = i + 1 #Value b can be found at this index: 14 + 1 = 15
frac = virtual_index - i #14.25 - 14 = 0.25
a = sorted_data[i] #sorted_data[14] = 5
b = sorted_data[j] #sorted_data[15] = 6
percentile = a + (b-a) * frac #5 + (6-5) * 0.25 = 5.25
print(f"Value at index {i} is: {a}")
print(f"Value at index {j} is: {b}")
print(f"The value of frac is: {frac}")
print(f"The found percentile is {percentile}")

The code above prints:

[2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 8, 8, 12, 20]
Value at index 14 is: 5
Value at index 15 is: 6
The value of frac is: 0.25
The found percentile is 5.25

When the virtual index is a whole number, the percentile is just the value at that index. But this is also what comes out of the code above:

Because frac will be 0 and so a + (b-a) * frac will just be a.

That was it, I hope you now understand how numpy’s percentile function uses linear interpolation — by default — to calculate percentiles.

Thank you for reading!

I hope my post was helpful to you!

To learn more about Python and programming, follow this page and check out my E-books:

--

--

Better Everything

📖 My E-Books: amazon.com/author/better-everything ✅Programming, Data & Business ✅Automation & Optimization ✅Knowledge & Inspiration