Median Finding Algorithm

Allen Huang
7 min readMar 25, 2020

--

Suppose we have an array: [ a1, a2, a3, a4 …]. We want the index i that there are [n/2] numbers larger than ai.

Firstly, what about using a sort algorithm and then find the middle index? That may be a good idea with an O(nlogn) time complexity, however, today we will look at two better algorithms, not only can achieve an O(n) time complexity, but also can be applied to a wider range of the problem.

Randomized Quick Select Algorithm

Assume that items in our array are all distinct, which is for simplicity. We can firstly choose a random element ai in the array, and call it our pivot. Then we compare each item in this array with our pivot and put these items in two different subarray.

LESS = {aj such that aj < ai}

GREATER = {aj such that aj >ai}

|LESS| + |GREATER| = n-1

*|LESS| means the length of LESS

We can use this algorithm to find the k-th smallest element in our array.

If K ≤ |LESS|, that means our target must in the LESS set, so we just need to find the k-th smallest element in LESS. If K = |LESS| + 1, our pivot is the answer! Otherwise, we need to find the (K -|LESS|-1)-th smallest item in GREATER. The key of this algorithm is, we only recurse on a part of our array.

Let’s look at a specific example, suppose our array is [1, 3, 5, 4, 10, 6], and 4 is randomly select as our pivot.

LESS = {1, 3}

GREATER = {5, 10, 6}

If our target k is 2, 2 ≤ |LESS|, we find the 2nd smallest item is LESS, which is 3. If our target is 3, 3 =|LESS| + 1, our pivot 4 is the answer. If our g=target is 5, we already find our that our target is not in LESS, and it’s not our pivot, so we already have (1 + |LESS|) items smaller than our target. Finally, the 2nd smallest item in GREATER is our final answer.

The idea is very simple, especially similar to quicksort algorithm. The following code is my implementation of the quick select algorithm using Java.

Now, we are going to bound the running time of this algorithm. Here, we use the mathematical induction to prove that the expected number of comparisons for QuickSelect is at most 4n.

Firstly, we define T(n) as the following formula, T(n,k) means the expected number of comparisons to find the k-th smallest item in an array of length n, maximized over all arrays.

We can easily find out that T(n) is a non-decreasing function of n, because as our array size increase, we need to execute more comparisons.

Base case: T(1) = 0, when we have an array size of 1, we don’t need to do anything! Therefore, T(1) < 4*1

Inductive hypothesis:T(n-1) < 4(n-1)

In order to find the upper bound, we assume that we always recurse on the larger half. In order to calculate T(n), the first component is after we randomly select a pivot, we need to compare our pivot with other items in our array, which result in n-1 comparisons. Then, we recurse on LESS or GREATER part of our array.

Let’s look at our example, we have a 4 length array. |LESS| +|GREATER| = 3. Depend on our pivot, how many results we might have?

(|LESS|,|GREATER|) = (0,3) or (1,2) or (2,1) or (3,0)

We have four possible results of |LESS| and |GREATER| group. Like I said before, we are going to recurse on the larger part, which means, we recurse on 3, and then 2, then 2, and finally find our result in 3.

If we have an array with length 8, what’s the possible result of |LESS| and |GREATER|?

We have 8 possible results, the length of the new array that we recurse in has 4 possible value, which is n-1, n-2, n-3, and n/2. Therefore, we have n/2 possible value of i for T(i) and the possibility of each value is n/2.

T(n) equals n-1 (compare each item and our pivot) plus the expected T(i), which is our recursion part.

Here I am going to explain the third row:

The right-hand side is the average of i from n/2 to n-1.

Deterministic Select Algorithm

The above algorithm use randomness (randomly select pivot), now we look at how to perform O(n) comparisons without use randomness. The idea is, we want to deterministically select the pivot rather than randomly select. What if we select the median as our pivot? and then the LESS and GREATER subarray have the same length. That’s definitely perfect! However, it’s pretty hard to achieve.

Therefore, we give ourselves leeway by assuming the pivot can be somewhere that is roughly in the middle of our array. More specifically, at least 3/10 of the array below the pivot and 3/10 of the array above the pivot.

Like the above example, our pivot can be 7, 8, 10 or 15. How can we achieve this?

  1. Firstly, we group the array into n/5 group of size 5, and find the median of each group.

2. Recursively, we find the median of medians, call this p.

3. Use p as a pivot to split the array into |LESS| and |GREATER|

4. Recurse on the appropriate piece.

In a nutshell, there are two recursion in this method, one is finding the median of the median, and another is using quick select.

For example, the have an array with 15 items, we firstly group it into 3 groups, and find the median of each group, which are 8, 10 and 9. Then we find the median of these three medians, which is 9. That’s our pivot!

*red boldface numbers are the medians of each group

Why this algorithm work?

*red boldface numbers are less than our pivot, purple boldface numbers are greater than our pivot.

Suppose we have g groups. In this case, g equals 3. We have at least [g/2] groups (the group that it’s median is less than or equal to our pivot) that contain at least 3 element that is less than or equal to our pivot. In the above chart, our pivot (median of median) is in the green group. In the yellow group, there are 3 elements less than less or equal to our pivot, and in the purple group, there are 3 elements greater than or equal to our pivot.

Let’s prove this linear time algorithm!

As before, we define T(n,k) as the worse case time to find k-th smallest element in an array. In the first step, we have n/5 groups, for each group, it takes us O(1) to find the median of 5 items. In the second step, the size of the median finding is reduced, which will take us T(n/5). And the third step needs n-1 comparisons, so it’s an O(n).

Because we assume that at least 3/10 items are below our pivot, so the smallest value of |LESS| are 3n/10, and the largest value of |GREATER| is 7n/10. Therefore:

c is a constant that greater than 0. Let’s look at the stack of bricks of the recursion tree!

When we continuously expand this formula, we can find the rule:

That’s a Geometric series! Which means it is at most 10cn. If p is between 0 and 1, we can have:

The key property of this algorithm is n/5 + 7n/10 < n. And that’s why our recursion works! Therefore, we have the theorem that for constant c and a1, …, ak such that a1 + … + ak < 1, the recurrence

solves to T(n) = O(n).

Questions: What about divided our array into groups that contain 3 elements? Can this algorithm still work?

Well, let’s try. Find the median of medians takes us T(n/3), and in order to recurse on the larger side, we have:

There are at least n/3 items below our pivot, and the above part is 2n/3. Therefore, our final formula is:

because n/3 + 2n/3 equals 1, our recursion cannot work in this example.

Finally, let’s implement Deterministic Select in Java!

--

--