Sorting Things Out (Part 2)
In Part 1, we learned how to sort a collection of items in a way that mirrors how we think about sorting objects in real life.
We used an algorithm called selection sort, but when you use the sort function in your favorite language, it probably doesn’t use this algorithm. The reason is that it’s very slow.
Let’s assume that we’re running our selection sort algorithm on a computer that takes 1 second to sort a list with 1,000 items. That might not sound so bad right now, but if the list grows to 1 million items, it would take over 270 hours to sort the list!
You might be thinking that 1 million items sounds like a lot, but there are a lot of situations where you could be working with more than a million items:
- In 2015, there were about 218 million licensed drivers in the United States. That means there are about 218 million records for all those drivers in a database somewhere.
- In the United States, over 453.7 million different Social Security numbers have been issued. All of those records are probably stored in a database too.
- India currently has a population over 1.3 billion. We probably don’t want to alphabetize all those names using selection sort.
In this article, we’ll learn about a new approach to sorting that is much faster. It uses a problem solving technique called divide and conquer.
Divide and Conquer
In computer science, divide and conquer is used to describe the process of breaking up a problem into smaller sub-problems that are easier to solve. By solving the smaller problems, we end up with a solution to the larger problem.
First, let’s see if we can apply this idea to a real-world problem. Let’s imagine that we’re throwing a party, but before people come over, we need to clean the entire house. Ugh, this seems like a lot of work!
… but what if you broke up the task of cleaning the entire house into smaller tasks?
Instead of one large task, we now have the following sub-tasks:
- Clean the living room
- Clean the kitchen
- Clean the dining room
- Clean the bedroom
By completing these smaller tasks, the task of cleaning the house is also completed. Another benefit of divide and conquer is that if our friends want to help, they can each take a room and get it done even faster.
Let’s reframe this analogy. When we’re writing code, the tasks in question are really just our code… and our friends are computer processors (or CPUs)!
Most modern computers have multiple processors (or multiple cores) that let your computer run multiple things at the same time. For example, the iPhone 8 has 6 CPU cores!
Divide and conquer algorithms are a great way to take advantage of all that computing power.
Breaking Up the Sorting Problem
One of the hard parts about the divide and conquer technique is learning how to reframe the problem. When the problem that you’re facing is conceptually difficult, the breakthrough usually comes from realizing that there is a case where the sub-task is trivial to solve.
So what’s the problem? Just like in Sorting Things Out: Part 1, the problem is that we have an unsorted collection of items and the goal is to sort them (but faster).
Can we think of any cases where the task is trivial?
I often think of this step as “stating the obvious”.
We can start by making a list of things that we know. Here are a few ideas related to sorting:
- an empty list is technically sorted because there’s nothing to sort
- a list with one item in it is already sorted
Wait. I think we’re onto something.
In Part 1, we used the example of sorting index cards.
What if we split up the problem so that we hand all of our friends a single card? Since they only have one card, their work is already done!
Conquering the Sorting Problem
At this point, your friends are all holding a single index card and you’re wondering how that helped at all.
Once the work has been divided up and completed, we need to stitch the results together. This is how we conquer the problem!
How do we combine subtasks?
In the divide step, we broke down the input so that everyone only needed to sort a list with one item.
In this step, we need to combine those items. To do this, all we have to do is compare the two items and return a single larger list where they are sorted.
The key insight here is that the two smaller lists that we’re combining are already sorted. Every time we combine two lists, we create a larger list that’s also sorted.
The combine step merges the two smaller lists together by taking the smaller of the two items off of the smaller lists.
That’s the name of this algorithm: merge sort!
If you combine two sorted lists this way, you end up with a single sorted list that contains both of the smaller lists, but you didn’t have to compare all of the items! You’re only comparing the first item in each of the lists — and taking the smaller (or larger) one based on the sort order.
In Part 1, we learned that selection sort has a time complexity of O(n²), which is very slow when you’re dealing with a large number of items.
Now let’s figure out the time complexity of the mergesort algorithm.
There are two main parts to the mergesort:
- Divide the input until we have a sorted list
- Merge the sorted lists together into a single sorted list
Divide the Input
When we break up the work, it makes sense to divide the input in half.
That way, the work is distributed fairly. If we needed to alphabetize 100 books, it wouldn’t be fair to give one person a single book and give the other 99 books to someone else. It would also take longer because the person with one book would already have their book(s) sorted and they’d just have to wait around until the other person finishes sorting all their books.
We can break down the input fairly by splitting it into two equally sized groups. Let’s see how many splits we need until we have a list with one item in it (that is by definition sorted):
- Split 1: 100 items
- Split 2: 50 items
- Split 3: 25 items
- Split 4: 12 items (the other has 13 items)
- Split 5: 6 items
- Split 6: 3 items
- Split 7: 1 item (the other half has 2 items)
Let’s graph the number of splits that we need based on the size of the input:
This graph shows how many times the input needs to be split up before we have a list with 1 item in it. It has the shape of a logarithmic curve, which is written as O(log(n)) in Big-O notation.
Merge the Sorted Lists Together
When we merge two lists together, we need to iterate over every element in both of the lists.
The general algorithm is to check the first element of each list and build a new list by taking the smaller of the two items at each step.
In the “divide” step of the mergesort algorithm, we kept splitting up the input in half. If there are n items in the input list, then there would be n/2 items in the left half and n/2 items in the right half. This is important because when we merge the two lists back together, we have to look at all n items. This is expressed as O(n) in Big-O notation.
We just discovered that our divide and conquer algorithm has the following characteristics:
- Dividing the work has O(log(n)) time complexity
- Combining the sorted lists has O(n) time complexity
The number of times we have to combine lists is related to the number of times that we divided the input into smaller sub-tasks.
We can express this relationship by saying our algorithm has O(log(n)) * O(n) time complexity, but we can simplify the expression into O(nlog(n)).
The following implementation is written in Python because the language has a lot of built-in features that make it easy to read.
We’ve already described how the algorithm works, so let’s take a look at how to implement it:
The merge_sort function contains the high-level algorithm:
- If the input only has 1 item, it’s already sorted, so we can return the input.
- Otherwise, split the input in half and sort those two smaller lists.
- Since the result will be two sorted lists, use merge_lists to combine them into a single sorted list.
In the merge_lists function, we use variables to avoid the overhead of modifying the list. l_index and r_index are variables that store the current position in the left and right lists (respectively). To consume the smallest element off the front of a sublist, we increment the respective index to walk that list.
Merge sort was invented by John von Neumann in 1945 when he realized that a collating machine could take two sorted stacks and combine them into a single sorted stack.
The curious thing about merge sort is that it’s quite old, but people are still talking about it! There’s an article from 2014 about a Parallel In-Place Merge Sort variant in Dr. Dobb’s journal. Merge sort is a divide and conquer algorithm, so if you’re adventurous, you could try to write a version of merge sort that runs on multiple CPUs (or even across multiple computers) to compare the performance.
The other thing that’s interesting about merge sort is that it’s everywhere! Java’s built-in sort method is based on merge sort—and some of the biggest companies in the world use Java (including Google and Netflix), so there’s a good chance that this algorithm is being used when you’re using an Android device or you’re watching your favorite show on Netflix.
This is part of a series that I’m creating to teach the fundamentals of computer science and algorithms in a way that’s approachable to people without a computer science background.