The magic behind the sort algorithm in Python

Angela Heumann
Women Data Greenhorns
5 min readJul 18, 2018

Algorithms are a method that turns a given input into the desired output. They allow us to change data structure in many different ways. While the theory behind these algorithms can get really complex, we are going to keep the examples rather simple.

In order to achieve this while working with the wonderful programming language Python, we will use the sort function. The subsequent pages will show how it works, how to use it and how to alter the results, based on special requirements or preferences.

The theory behind the sort algorithm

1. Insertion Algorithm

The insertion algorithm uses an in-place comparison-based technique. Its average and worst-case complexity is O(n²).

With the insert algorithm, we start out comparing the first two values. The lower value is added to the sorted sublist. Then each further value pair repeats the same process. The result of the new value pair comparison backtracks the sorted list values until it finds its correct place. This can reduce the size of the already sorted sublist if necessary.

We’ll use the following list of numbers for the explanation: numbers = [6, 4, 3, 10]. In the examples, the first array represents the sorted list while the second array still needs to be sorted.

Step 1

Compare 6 and 4
1.1 -> [] [4, 6, 3, 10] // 6 and 4 needed to swap places
1.2 -> [4] [6, 3, 10] // 4 is now part of the sorted list

Step 2

compare 6 and 3
2.1 -> [4], [3, 6, 10] // 6 and 3 needed to swap places
2.2 -> [] [4, 3, 6, 10] // 3 < 4 -> reevaluate the sorted list

Step 3

compare 4 and 3
3.1 -> [] [3, 4, 6, 10] // 4 and 3 needed to swap places
3.2 -> [3, 4] [6, 10] // 3 < 4 -> no reevaluation needed

Step 4

compare 6 and 10
4.1 -> [3, 4] [6, 10] // correct order already present
4.2 -> [3, 4, 6] [10] // 6 > 4 -> no reevaluation needed

Step 5

check 10
5. 1 -> [3, 4, 6, 10] // correct order already present

2. Merge Algorithm

Merge sort uses the divide and conquer technique. Basically, it divides a list into equal parts until only atomic values remain, then merges them in a sorted manner. Its worst-case complexity is O(n log n).

We’ll use the same list of numbers for the explanation: numbers = [6, 4, 3, 10]

Step 1

split list
1.1 -> [6, 4], [3, 10] // split in equal parts
1.2 -> [6], [4], [3], [10] // split in equal atomic parts

Step 2

merge and sort equal pairs
2.1 -> [4, 6], [3, 10] // 6 and 4 needed to swap places

Step 3

repeat Step 2 until done
3.1 -> [3, 4, 6, 10] // 3 needed reposition

3. Best of both worlds: Timsort Algorithm

The Python way is a hybrid sorting algorithm which is derived from merge and insertion sort. For smaller runs (up to a minimum run size of 64) Timsort internally picks insertion sort, otherwise merge sort is being used. Its worst-case and average complexity is O(n log n), but the best-case performance is O(n).

Use cases of the sort algorithm

The sorting algorithm is implemented as list.sort() or sorted(list). The difference between these two implementations is that list.sort() rearranges the original list while sorted(list) returns a new list.

1. The sort parameters

The sort methods do not require any mandatory parameters but they offer two optional parameters.

  • key
  • reverse
Syntax
sorted(list, key=…, reverse=…)
or
list.sort(key=…, reverse=…)

The key is a function like len, int, str.lower that serves as key for the comparison. The reverse parameter defines the order. The default value is False which means sorting in ascending order. By using reverse=True we reverse the list.

2. Basic examples for sorting a list

Let the following lists be our baseline:

scholarship = [“Udacity”, “Bertelsmann”, “Data”, “Science”, “Scholarship”, “Program”, “women”, “who”, “code”]
string_numbers = [ “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”, “10”, “20”]

Default sort:

sorted(scholarship)
[‘Bertelsmann’, ‘Data’, ‘Program’, ‘Scholarship’, ‘Science’, ‘Udacity’, ‘code’, ‘who’, ‘women’]

→ standard order/ascending, case sensitive, alphabetical order

Defining custom sort key and order:

sorted(scholarship, reverse=True, key=len)
[‘Bertelsmann’, ‘Scholarship’, ‘Udacity’, ‘Science’, ‘Program’, ‘women’, ‘Data’, ‘code’, ‘who’]

→ reverse sort by length

For all string with the same length, the reverse alphabetical order is applied: Scholarship > Bertelsmann. In case of standard order, the result would be ordered Bertelsmann > Scholarship.

3. Special Cases for sorting a list

Using str.lower as key parameter:

sorted(scholarship, key=str.lower)
[‘Bertelsmann’, ‘code’, ‘Data’, ‘Program’, ‘Scholarship’, ‘Science’, ‘Udacity’, ‘who’, ‘women’]

→ case insensitive order

Using string representations of numbers:

sorted(string_numbers)
[‘1’, ‘10’, ‘2’, ‘20’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7’, ‘8’, ‘9’]

→ standard alphabetical order in which 1 is followed by 10

This is possibly an unexpected result, and one of the most commonly seen traps a beginner, or even intermediate programmer, falls into. The expected result is the correct ordering of the numbers. A compiler however does not know that we are looking for the number representation.

There are several ways to solve the issue:

  1. Changing the key to an integer comparison:
sorted(string_numbers, key=int)
[‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7’, ‘8’, ‘9’, ‘10’, ‘20’]

→ numerical order, same result as the numerical sort of a list of integers

2. Changing the key to a length comparison:

sorted(string_numbers, key=len)
[‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7’, ‘8’, ‘9’, ‘10’, ‘20’]

→ sort by string length & standard sort is applied for the same length, same result as above

The key=len comparison has a performance advantage since we don’t have to cast for comparison. The method also works for mixed lists, while the key=int version could raise a Value Error in the case of an ill-formatted string, or a plain oversight.

Conclusion

The Timsort algorithm used in Python takes advantage of both Insertion and Merge algorithms and therefore is a solid tool for most use cases in Data Science. Additionally, it does provide a wide range of sorting possibilities in the form of keys, which can be changed to every function and even a self-defined function. This means the user is completely free to adjust the outcome of the sorting algorithm.

Additional tidbit

In Python 3.2 the Timsort algorithm has been updated to run faster when called with a key function. The two arrays of keys and values are now sorted in parallel.

Resources

--

--