Comparison Sorting Algorithms and Mystery of NlogN complexity
Skip to the end, for tldr.
“A comparison sort must have an average-case lower bound of Ω(n log n) comparison operations” so familiar, I have come to see similar sentences in many articles/discussions about sorting algorithms, the most popular family of algorithms in Computer Science. It is mostly accompanied with the ultra mysterious sentence “it has been proved that…”, and I have always wondered “why?! it seems so mathemagical!”.
Turns out it was not that magical, all you need is some basic discrete mathematics; Trees, Permutations, Logarithmic arithmetics and a bit of logic. This article is to remove the mist around it. But first an intro to how I came to find out about this simple(?) proof.
A couple of weeks ago, at work I was asked to find most suitable way to sync two directories over two different machines, I had known rsync and how powerful it was from previous experience, It simple had a flag for every option you might think of and was really fast and efficient, I wondered who made it and what was the motivation behind it. Turns out Wikipedia has a nice article about it.
Andrew Tridgell and Paul Mackerras wrote the original rsync, which was first announced on 19 June 1996. Tridgell discusses the design, implementation and performance of rsync in chapters 3 through 5 of his Ph.D. thesis in 1999. It is currently maintained by Wayne Davison.
Oh, nice, Only if I could find the Tridgells’ Ph.D. thesis. let’s find out who he is, turns out(again from wikipedia) that he is also the auther of samba file server, wow! During the search about who he is a stumble upon his personal page and there it is, the link to his thesis!
reading the short abstract made me more interested and dove into it, turns out there is a good chapter about sorting algorithms and parallel sorting algorithms. in “1.1 How fast can it go?” he says “It is well known that comparison-based sorting algorithms on a single CPU require logN! time” yeah, but why? wait is that “logN!”? not “NlogN”? wait, there are similar, logN! in terms of complexity might be equal to logN^N and that is NlogN, it might be similar to O(N!) = O(N^N), nice, but why is it “logN!”? he continues “which is well approximated[Knuth 1981] by N logN” knuth? Donald Knuth? ok, this keeps getting more interesting!
Finally for “why logN! ?” in the first place he adds “This limitation arises from the fact the the unsorted data can be in one of N! possible arrangements” yeah, that is right for any N elements(assuming unique elements). and finally the the finishing touch “and that each individual comparison eliminates at most half of the arrangements from consideration.”
It is well known that comparison-based sorting algorithms on a single CPU require logN! time1 , which is well approximated[Knuth 1981] by N logN. This limitation arises from the fact the the unsorted data can be in one of N! possible arrangements and that each individual comparison eliminates at most half of the arrangements from consideration. ~A. Tridgell “Efficient Algorithms for Sorting and Synchronization” (February, 1999)
Wonderful, each comparison basically eliminates a branch in a binary tree with each leaf a different permutations of N numbers. Each comparison is basically a test to see if number A or number B is going to some position, and depending on which is going there you eliminate half of the possibilities left, you are choosing 1 out of possible 2 outcomes that have the same probabilities.
It is not simple, a layman would find it difficult to comprehend, but people with knowledge of what Permutation, Complexity and Binary Tree are should not find it too difficult. It is not too complex for a CS undergrad student in his/her algorithms course or a Software Engineer familiar with sorting algorithms, and who else would wonder about the complexity? that is what amazes me the most, why is it not talked about, and Taught. People interested in it are mostly people capable of understanding it but it has always had “it is too complex for this context” kind of approach in my experience.
Include it in discussion of sorting algorithms as extra information. Not every person might find it interesting, some people even have the “standard libraries already provide efficient algorithm implementations, why should I care about sorting or balanced search tree algorithms?” mindset. Although they are not completely wrong, I am sure knowing these things might help many engineers write better code or simply make them like what they do even more because of the fancy mathematics in the background.
This was my journey about learning something I have wondered since I was introduced to sorting and algorithms in general. I am very excited to finally find out about it.
Disclaimer: “comparison” or “Comparison Based” are really important in this article, some sorting(non-comparison) algorithms like Radix Sort or Bucket Sort have better complexities for restricted data sets(radix for strings consisting of characters from a finite set of characters and bucket sort for ranges of integers). So consume with this in mind.
tldr: For a set of N unique numbers there are N! different permutations(with only one of them being sorted), each comparison eliminates half of the permutations left, so best possible comparison based sorting algorithm have to have a complexity of logN!, N! in terms of complexity can be approximated by N^N, thus it becomes logN^N which is equivalent to NlogN.