Started From the Bottom Now We’re Here: Problem Solving with Algorithms
Median Select: Not Your Average Algorithm
Today, I want to talk about problem solving like an algorithmist. My spell checker is yelling at me with red squiggles, but algorithmist is totally a word, and an algorithm is just a sequence of steps (a recipe, if you will) to get from a problem to a solution.
Furthermore, an algorithm should:
- solve a generalized version of the problem, and
- be efficient
This means that instead of thinking about Jack has 7 apples, Jill has 10 apples, how many apples should they give to Patience?, we want to solve Jack has x apples, Jill has y apples, and the objective of the algorithm is always to decide how much food Patience should get. I’m kidding. We’d also like to be able to solve the problem quickly, and we’d like to be able to express what “quickly” means.
Now this is looking like math, and it is math. An implementation of an algorithm is a function, and a function takes input (our problem domain expressed as some variable(s)) and gives back an output (our solution expressed as some other variable(s)).
In the rest of this post I’ll be using standard notation for algorithm analysis (i.e. big-O), and we’ll just be talking about analysis in terms of time. Basically, big-O describes the worst-case performance of an algorithm, that is, how long it takes to solve an infinitely large version of our problem. Usually that time depends on the input, so for example, if n describes the size of our input, saying that some algorithm is O(n) means that the time it takes to run it has a linear relationship with n (grows linearly with n). Here’s the mathematical definition.
Okay, you’re bored — let’s try an actual problem! In this post, I want to look at just one problem, and think about how we might solve it in different ways, each time trying to be a little bit smarter and improve our algorithm until it is as efficient as possible.
Find the median of a list of numbers
The median is the “middle” number in a sorted list of numbers (not to be confused with the mean/average). So, if we have a sorted list (either smallest to largest or largest to smallest will work), and we know the size of the list, let’s just grab the number at position size/2, and then we’re done. If the size of the list is even, we’ll grab the two middle numbers and average them. Boom.
That’s too easy, so let’s assume that the list is not sorted.
Take #1: Brute Force
You probably already have an idea of an algorithm to solve this problem and I can guarantee it’s not the one I’m about to suggest. Bear with me.
The “brute force” approach in computer science refers to generating all possible solutions, and checking them one by one until we’ve found the right one.
For the median search problem, what are all our plausible solutions?
Well, our solution is a number, and we have a list of numbers to consider. Any one of them could be the median, so let’s just take all of them and test ’em by asking each number: are you the median?
Now we have to figure out how this isMedian? function might look. One idea is, well, let’s take the candidate we want to test, and the list of numbers minus the candidate. We’ll go through the list of numbers and count how many are smaller than the candidate, and how many are bigger. If those numbers are equal, great — this number is our median. (Some minor adjustments to be made of course for the case where we have an even number of numbers and have to find two to compute the median, but the intuition is the same.)
How does this perform? Each time we check a potential median, we have to walk through the entire list of numbers — that costs n. And we have n of these candidates! So our algorithm runs in O(n²).
Take #2: Sorting
We already talked about how wonderful this problem would be if our list of numbers was sorted. So let’s sort it! And hopefully the time it takes to sort will be less than O(n²).
The most obvious way to sort a list of numbers in ascending order is to go through it, find the smallest number, stick it at the front, find the second smallest number, stick it in second place, rinse and repeat.
This algorithm is called selection sort. How long does it take?
Well, finding the smallest number could require scanning the entire list if it was hiding at the end. Similarly, the second smallest number could be at the end of the remaining n-1 elements. We have to do this for all n numbers, each time potentially costing us n, giving us an overall performance of O(n²).
So this is no better than brute force.
Well, I’m going to cheat a little bit. There happens to be a whole class of sorting algorithms, some of which perform much better than selection sort. One of these is quicksort, and yep, it’s quick — it runs in O(nlogn). (In case you were wondering, this is the best/fastest we can do: no sorting algorithm can do it more efficiently. Why not? Here’s a great explanation.)
How does quicksort work?
Basically, we want to split up our list. We’ll randomly pick a “pivot” from the list, reorder the list such that all elements smaller than the pivot are to the left of the pivot, and all elements larger are to the right. Note that the numbers in the two partitions are not yet sorted within themselves!
But now we know that the pivot number is in its final, correct spot.
And we’ve broken up our problem into two smaller sub-problems, and all we have to do is run quicksort on those smaller sub-problems. At some point we’ll have sub-problems of size 1 (the sub-list only contains one element, which is by definition sorted), and then we’re done. I’ll leave it to you to see why this is O(nlogn).
Take #3: Quickselect
Too bad, I’m still not satisfied with O(nlogn). What else can we do?
Let’s revisit the idea of partitioning the list that we were doing when employing quicksort. Quicksort is a divide and conquer algorithm, and the cool thing that happens whenever we divide the list is that the “pivot” ends up in its rightful place in the final, sorted list.
What if we chose a pivot such that we ended up with exactly half of the numbers on the left of the pivot, and the other half on the right? Since we know that the pivot is in the correct spot, and that spot happens to be the median spot, we know the pivot is the median! And we don’t have to bother doing anything else with the two partitions.
Now we said before that we “randomly” pick pivots. If we could somehow always pick the pivot that happens to be the median, we wouldn’t be trying to come up with algorithms to do it ;).
But this is still helpful because after we pick a pivot, we know which partition the median is in! The median is always in the size/2-ith position. If our pivot position is to the left of that, we only have to focus on the right partition. If our pivot position is to the right, we only have to look at the left partition.
This way we’re able to kind of throw away and just avoid doing work on the partition we know the median doesn’t belong in. Eventually, we will pick a pivot that ends up in the median position, and we’ll return that.
Isn’t this also O(nlogn) if we’re doing quicksort-y stuff?
On the first run we’ll look at n numbers. But on the second run, making some assumptions about picking reasonable pivots on average across all pivots (a pretty good assumption since we’re choosing randomly), we’ll only have to look at n/2 numbers. On the third, another half of that — n/4.
So we’ve got n + n/2 + n/4 + … + n/x with x approaching infinity. This summation approaches 2n, which is O(n). This is fantastic — intuitively, we kind of know we can’t get any better than linear time since just looking through the list takes n time, and we definitely have to see everything in the list to find the median. We’ve optimized this algorithm to the max!
Take #4: Deterministic median-of-medians
One weakness of quickselect is that it’s a randomized algorithm. Remember that we’re always picking our pivots arbitrarily, so our algorithm is sensitive to those choices. This means that even though it is fast most of the time, sometimes you really can get bad pivots that force you to recursively run quickselect many times and slows things down.
Sometimes, it’s more important to be able to say an algorithm is consistent and predictable rather than “usually fast”. This is what deterministic means in this context.
Can we come up with something that is deterministic and runs in linear time?
If we can guarantee a “reasonable pivot”, then the partitioning concept from quickselect will still work. In particular, we want to be able to throw away lots of numbers with every partitioning. In other words, a good pivot will help us quickly prune the list of numbers that we’re interested in (i.e. the sublist that the median is in).
This is the basis for the median of medians algorithm:
- Separate the list of numbers into small and equal-sized groups (groups of 5 is a nice choice).
- Find the median of each group. (The groups are small, so we can assume the work here is O(n)).
- Find the median of all those medians, let’s call this M.
- Use M as our pivot.
How do we know M is a good choice for a pivot?
Well, we know that the partition to the left of M contains the medians from roughly half the groups (since M is the median of those medians). And within those groups, about half of the numbers in each group are less than the median (still just using the definition of median here). So that seems like a fair amount of numbers we can throw away if the median is in the partition to the right of M! Since we can make the exact same claim for the partition to the right of M, no matter which partition we keep working on, we’ve reasonably reduced the size of our list.
Since we’re still basically using quickselect, our performance is still O(n). (If you’re not convinced, check this out for a great rigorous proof using recurrence relations and all that good stuff!)
We’ve done it and NASA should hire us.
That was a lot of work. Why do we even care about the median search problem?
If you don’t care about medians, I don’t blame you (I don’t have much use for them either). But what I definitely care about is this approach to coming up with better and better algorithms. I’ve presented these four algorithms in such a way that the why and how we get from #1 to #4 is somewhat recognizable, but unfortunately there isn’t always a systematic progression from brute force to something clever. Sometimes, you’ll recognize optimizations you can do from looking at the brute force solution, but often not. (Can you see much resemblance between brute force and median-of-medians? I sure can’t.)
Another thing to keep in mind is that good algorithms are meant to be applicable to lots of problems. For example, quickselect doesn’t just work on medians, it can select any k-ith element in linear time. But even more general than that, like I talked about before, we group together algorithms that do similar things into classes, such as divide and conquer, that form the foundation for lots and lots of other algorithms.
And as I’m learning in my Algorithm Design and Analysis class, if you’re able to identify a problem as being similar to another, ideally one that you have a good algorithm for, it’s just a matter of reducing from one problem to another, and then running your algorithm on it. More on that later :)
This whole post was inspired by a conversation I had with my friend Kuba. Thanks for always being willing to sit down with me and teach me cool things. :)