Bucketing technique for calculating median and k-th percentile of a large dataset

Mikko Nylén
4 min readOct 19, 2018

--

Median and k-th percentile are useful statistics for understanding data, but the traditional algorithms for calculating them require the full dataset to be present. This can be troublesome in scenarios where there’s just too much data to fit into memory.

I first came to think of alternate ways to solve this problem when I was working on a job interview question. The problem setting was that a web server was keeping track of request durations (with timeout being set at 30,000 ms) and would need to be able to provide median over 7 days of traffic. This particular web server was being hit pretty hard: it was averaging 30,000 requests per second. Needless to say, if we were to keep all the request durations in memory for 7 days with this much traffic, it would require quite a lot of RAM:

30,000 requests/sec * 3600 secs/hour * 24 hours/day * 7 days
= 18,144,000,000 request durations to keep track of
or 34 GB using 16-bit integers

That’s a tad too much memory for just being able to monitor request durations on a web server.

Now, we can do better. There’s actually nothing in the median and k-th percentile algorithms that require you to keep around the original data set, especially not in the insertion order. We only need the sorted data set in ascending order. Let’s see the algorithm:

fn median(values) {
sorted_values = sort(values)
value_count = count_values(values)
if (value_count == 0) {
ERROR can't calculate median of an empty dataset
} else if (value_count % 2 == 0) { # value count is even
middle_index = value_count / 2
sum_adjacent = nth(sorted_values, middle_index - 1) +
nth(sorted_values, middle_index)
return sum_adjacent / 2
} else {
middle_index = value_count / 2
return nth(sorted_values, middle_index)
}
}

So we need a way to count the values and to access the nth value in the sorted values array.

To save some memory, we could compress the values into frequency map instead — count the frequency of each duration (bucket) and discard the individual values. When a request is processed, we just increment the appropriate bucket.

For example:

value_frequencies = {
0ms: 1,
1ms: 1,
3ms: 4,
...
400ms: 1
}

Counting the total number of values (value_count) from this structure is as easy as summing the values of each bucket together:

fn count_values(value_frequencies) {
return sum(values(value_frequencies))
}

Getting the sorted_values is a little bit trickier. We don’t really want to reconstruct the list, because that would mean we end up creating an array that holds 34 GB of data — we kind of wanted to avoid consuming so much memory. What we can do is replace the use of nth function with sorted_nth that answers the question “what is the nth item in the sorted data set”:

fn sorted_nth(value_frequencies, n) {
bucket_start_index = 0
unique_values = sort(keys(value_frequencies))
for (bucket in unique_values) {
frequency = value_frequencies[bucket]
bucket_end_index = bucket_start_index + frequency
if (n < bucket_end_index) {
return bucket
} else {
bucket_start_index = bucket_end_index
}
}
}

As my CS professor used to tell us, working through how this works is left as an exercise. Just kidding. It’s quite simple though. In a loop over all unique buckets ordered in ascending order) we calculate two indices:

  1. bucket_start_index — the first index (inclusive) where that bucket would occur in the sorted data set
  2. bucket_end_index —and the last index (exlusive)

If the n we are looking for falls between these indices, we’ve found what we’re looking for and can just return the bucket. Otherwise, we advance bucket_start_index to the bucket_end_index and move to the next bucket in order.

We can quickly apply this to some test data.

> sorted_values = [0ms, 1ms, 1ms, 2ms, 3ms, 3ms, 3ms, 3ms, 4ms, 4ms]> value_frequencies = frequencies(sorted_values)
=> {0ms: 1, 1ms: 2, 2ms: 1, 3ms: 4, 4ms: 2}

Let’s say we’re looking for n = 4, it means we’re looking for the 5th value (indexing starts from 0). From sorted_values we can see the right answer is 3ms. If we just table out the execution on each loop round we can confirm our sorted_nth returns the same answer:

As expected, the loop terminates on the fourth round because the bucket 3ms resides in indices (4, 8( — and 4 ≤ n < 8.

The k-th percentile, as promised

Now that we have implemented memory efficient sorted_nth , we can easily come up with implementation for k-th percentile as well:

fn kth_percentile(value_frequencies, k) {
value_count = count_values(value_frequencies)
if (value_count == 0) {
ERROR can't calculate kth percentile of an empty data set
} else if (k >= 100) {
ERROR k must be less than 100%
} else {
index = (k / 100) * value_count
if index % 1 == 0 { # index is a whole number
sum_adjacent = sorted_nth(value_frequencies, index - 1) +
sorted_nth(value_frequencies, index)
return sum_adjacent / 2
} else {
return nth(value_frequencies, floor(index))
}
}
}

Some considerations

This technique works great when there’s a finite number of unique values in the data set. In this particular case the requests have a clear minimum and maximum duration, and we track them only on millisecond precision. Thus there can be only 30,001 unique values in total. It’s totally fine to store 30k integers in memory.

But in some other cases this might be a problem —for this algorithm to work well, you’d want the number of unique values to be order of magnitude smaller than the total number of values. This can easily be case if there’s no lower or upper limit, or if decimal numbers are used.

--

--