Sorting algorithms are an awkward combination of fundamentally necessary, and deeply contentious. From new engineers looking to impress in an interview to older engineers searching out a solution to a rapidly scaling database, there are myriad factors to take into consideration. What is the speed of comparison between two objects? What’s the swap time? How large is the database? What type of objects does it contain? Is it already semi-sorted? Do the results need to be stable?
Each of these questions may draw out arguments in favor of one algorithm or another. Is the source data large and complex? Most languages default to the standard, Quick Sort with its O( n log n) time complexity. Is it smaller? Insertion Sort works wonders on those. Mostly sorted? Heck, Bubble Sort might almost work for that. If you want to read/visualize the merits of each, check out this comparison by toptal.com.
One sorting algorithm you won’t find on that site, or almost any other, is Tim Sort. This obscure sort is currently unique to Python, and is used as its default sorting algorithm. Call
array.sort in Python, and Tim Sort is what gets executed. In spite of this, it’s rare to find engineers who both know and understand Tim Sort. So: what is it?
Tim Sort was first implemented in 2002 by Tim Peters for use in Python. It allegedly came from the understanding that most sorting algorithms are born in schoolrooms, and not designed for practical use on real world data. Tim Sort takes advantage of common patterns in data, and utilizes a combination of Merge Sort and Insertion Sort along with some internal logic to optimize the manipulation of large scale data.
Why Tim Sort?
Looking at figure 2, we can immediately see something interesting. At its best, Tim Sort outperforms Merge Sort and Quick Sort. At its worst, it runs at comparable speed Merge Sort and actually outperforms Quick Sort. In other words, it’s unexpectedly fast.
In terms of space, Tim Sort is on the worse end of the spectrum, but the space consideration for most sorting algorithms is pretty sparse. O(n) isn’t too rough in most instances; it’s worth noting as a possible deficiency, and the one place where Quick Sort really outshines Tim Sort.
The final item that sorting algorithms are often judged on is stability. Stability is the concept that when sorted, objects of equal value maintain their original order. Now, you might be wondering why we care about that. The items are of equal value — why do we mind how they are ordered?
The simple answer is that stability matters for stacked sorts. That is to say, you first sort based on one criteria, then based on a second one. If you do this in an unstable algorithm, you instantly lose any reliability from your first sort when you run the second one. For reference, Quick Sort is unstable, and Merge Sort is stable.
Tim Sort is also stable, not to mention fast if slightly heavyweight(in comparison to Quick Sort only). While sorting algorithms can (and should) be judged on other considerations, these are the big three.
The Implementation In Three Steps
Tim Sort is complex, even by algorithmic standards. The implementation is best broken down into parts.
1. Binary Search
The first thing you need in order to implement a Tim Sort is a binary search method. This is just used to implement your Insertion Sort later.
For reference: Binary Search Algorithms
2. Insertion Sort & Merge Sort
Secondly, you need to code Insertion Sort and Merge Sort. These are familiar algorithms, and should be in the back pocket of most engineers, but we’ll go over the fundamentals of how they work and why they are valuable to us here.
Insertion Sort is a very basic sorting algorithm. It runs through the array, and whenever it encounters an item that is out of order (strictly less/more than the item before it), it moves it into the appropriate position in the already sorted array. Insertion Sort is notorious for working very quickly on already sorted arrays, as well as smaller arrays. In fact, we can see from Fig 2 that Insertion Sort has an impressive best case run time of O(n). Keep in mind moving forward with Tim Sort: best case for Insertion Sort is an already sorted array. It might sound silly, but that will be relevant.
Merge Sort on the other hand operates by a basic principle: it is exceedingly easy to merge already sorted arrays. So, it splits a starting array in half over and over until it is nothing but single elements. Then it slowly rebuilds the main array by merging those elements back together in sorted order. Because we started from building blocks of size one, it was very easy to build initial sorted arrays. Then, it’s easy to merge them. In the end, we spend O(n log n) time, and (importantly) we do so in a manner that is guaranteed to be stable.
For example implementations see:
Merge Sort: https://www.geeksforgeeks.org/merge-sort/
Insertion Sort: https://www.geeksforgeeks.org/insertion-sort/
3. Implement Tim Sort
The key to understanding Tim Sort’s implementation is understanding its use of runs. Tim Sort leverages naturally occurring presorted data to its advantage. By presorted we simply mean that sequential elements are all increasing or decreasing (we don’t care which).
First we set a
minrun size. What we mean by this is that we want to ensure that all our runs are at least a certain length. Please note that we are not guaranteeing that we will find runs of this size — we’ll get into this later. We’re simply saying that a run must be at least of a certain length.
When we encounter a run, we set it aside. When we find the longest run within a
minrun range. We now have a familiar data structure: a small, sorted array. If it is at least
minrun in length, then huzzah! We’re good to move on. If it isn’t, we put Insertion Sort into play.
You may remember from above that Insertion Sort is especially efficacious on two types of arrays: small ones, and already sorted ones. What we just made is a small, sorted array. If it isn’t at least
minrun in length, we reach ahead and grab enough other elements to complete the run, then use Insertion Sort to push them into our sorted array, quick and easy. Obviously, if a run encounters the end of the array you can let it be a little short.
Once you’ve created all of your runs (i.e. sorted subarrays), you utilize your Merge Sort to join them together. In a best case scenario the entire array is already sorted and Tim Sort is smart enough to know it doesn’t need to do anything else. Other times, it tends to just be extremely efficient. As an added benefit, both Insertion Sort and Merge Sort are stable, so the resulting array is stable.
For those who prefer bullets:
- Establish a
minrunsize that is a power of 2 (usually 32, never more than 64 or your Insertion Sort will lose efficiency)
- Find a run in the first
- If the run is not at least
minrunin length, use Insertion Sort to grab subsequent or prior items and insert them into the run until it is the correct minimum size.
- Repeat until the entire array is divided into sorted subsections.
- Use the latter half of Merge Sort to join the ordered arrays.
Tim Sort is powerful. It is fast and stable, but perhaps most importantly it takes advantage of real world patterns and utilizes them to build a final product. Is it for every situation? Probably not. Good luck programing it on a whiteboard during an interview, and if you just need a quick simple sorting algorithm in a pinch you probably don’t want to bother implementing something this complex. However, for data scientists crunching numbers it is more than worth a look.
For the curious, you can check out the entire Tim Sort code on github.
Thank you one and all to my readers. I appreciate your time, and sincerely hope that you found the content informative. If you have any questions or replies feel free to drop one below.