Big-Oh For Algorithms

Explained Intuitively and Precisely

Tyler Neylon
26 min readJan 22, 2014


I learned almost everything I know about programming outside the classroom. Anything about coding is online and free, it seems. Big-oh notation is an exception. I have yet to see an online explanation of this topic that’s great for learning — specifically, that expresses the idea both intuitively and precisely.

This is my attempt to make it possible to learn about big-oh — and learn it well — online. I’m writing for programmers, so the focus is on applications to algorithms.

It’s not so bad if you haven’t already mastered big-oh notation. Before I learned it, I thought it was a technique that elevated you to a status of Jedi master coder. This is not true. Like many tools, it’s useful in specific situations — mainly, whenever you want to carefully measure the time or memory efficiency of a difficult algorithm.

I’ll use sorting algorithms as examples to show how the notation is useful. You probably won’t design new sorting algorithms — but you may want to design a particular function, class, or api with maximal efficiency. You may face code decisions where the trade-offs are not obvious, or difficult to describe precisely. This article explains tools that can help you write the best code in these cases.

A simple example

This Python function computes the average of an array of numbers:

# We expect `numbers` to be a non-empty list of real numbers.
def average(numbers):
sum = 0.0
for i in numbers: sum += i
return sum / len(numbers)

How fast is this function?

Its speed depends on the length of the input, so we can measure the algorithm’s time as a function of this length. Let n stand for the length of the numbers array, and let t(n) represent the number of += operations used — the += operation will act as our time unit for now. One += will occur for each number in the list. We can summarize this idea as

t(n) = n.

We could measure time other ways. For example, suppose that each = or + operation takes 1 time unit, and that division requires 4 time units. Treat a += operation as one = and + for the sake of timing. We can call this new measure u(n); we have

u(n) = 2n + 5.

Both t(n) and u(n) capture the function’s running time — it would be nice if we could express them in a way that de-emphasized the minor differences in measurement methods.

Big-oh notation can do that for us. Both functions could be summarized by writing t(n)=O(n) and u(n)=O(n), where big-oh removes the constants 2 and 5 from u(n). At a high level, an equation like t(n)=O(n) captures this intuitive idea:

The function t(n) grows at a rate proportional to or slower than n, forgetting scalars, any constant term, and small values of n.

This is a mathematically vague and imprecise statement, but by the end of the post, the details will be filled in.

The definition

I have a theory that people don’t really learn something until they use it to solve a problem.But you’re impatient, right? Just tell me the definition of big-oh, you say.

Ok. I will.

But the definition alone does not provide deep understanding. So if you want to really learn this stuff, stick around to the end! It makes much more sense after seeing it used in a few examples.

When used to describe algorithms, big-oh notation typically involves a positive integer value n that we think of as getting larger without bound — in math-speak, it is approaching infinity.


The notation f(n)=O(g(n)) means that there are numeric constants C and N such that all values n with n > N must have |f(n)| ≤ C ∙ |g(n)|.

That’s the mathematical definition. Let’s step back and see where that came from.

f=O(g) means f ≤ g, more or less

Now for the intuition — big-oh is a way to express when certain functions are nicely ordered. Looking at graphs, it’s easy to feel that f(x)=x is somehow less than f(x)=x² or that f(x)=log(x) is less than f(x)=√x. The ordering is not exact — for example 1/2 > (1/2)², so that x isn’t always < x². Yet this ordering is true most of the time, and this vague phrase most of the time is given a mathematically precise meaning using the definition above.

It’s nice to put functions in order since some functions can capture an algorithm’s running time — and we want to compare running times. We want to know when one algorithm is faster than another.

Big-oh is not identical to the ≤ sign — it’s just intuitively similar. It’s true that n=O(n²), but we also have 3n=O(n), and n+100=O(n). Even though 3n > n, they are intuitively growing at similar rates; the same is true of n+100 and n. To see how n+100=O(n) fits the definition, plug in the values N=100 and C=2: as long as n > 100, we have n+100 ≤ n + n = 2n. In this way, big-oh allows us to forget about the +100 part of n+100 — but not the squared part of n² compared to n since they grow at such drastically different rates.

Next we’ll see how this notation is useful for understanding algorithm efficiency; and before the posts ends, we’ll return to the definition to explore its personality.

A tale of two sorts

Suppose you want to choose between mergesort and a simple quicksort algorithm — which one is faster? This is a good question to learn from because the algorithms are intuitively clear, yet their analysis is not completely obvious.

Here’s a simple quicksort in Python:

def quicksorted(arr):
n = len(arr)
if n < 2: return arr
# Find left & right lists so left < pivot <= right.
pivot, left, right = arr[0], [], []
for item in arr[1:n]:
if item < pivot: left.append(item)
else: right.append(item)
return quicksorted(left) + [pivot] + quicksorted(right)

Quicksort uses a pivot element from the array to filter out two subarrays called left and right so that every element L in left, and every element R in right follows the rule:

L < pivot R.

From here, left and right are recursively sorted to provide the full sort.

Next is a Python mergesort:

def mergesorted(arr):
n = len(arr)
if n < 2: return arr
# 1. Split in half and recursively sort.
mid = int(n / 2)
left = mergesorted(arr[0:mid])
right = mergesorted(arr[mid:n])
# 2. Merge together.
sorted, i, j = [], 0, 0
while len(sorted) < n:
if left[i] < right[j]:
i += 1
j += 1
if i == len(left): sorted += right[j:]
if j == len(right): sorted += left[i:]
return sorted

This time, the array is split in half using a mid-point set at index int(n/2). These halves are recursively sorted, and a little work is spent merging them together to arrive at the fully sorted array.

Which one is faster?

This is such a common problem that you might have the answer memorized. It’s worthwhile to pretend it’s a new problem to you, and to appreciate that the answer is not obvious at first glance. After all, the main application of big-oh notation for coders is to understand the behavior of new algorithms.

Revealing graphs

Let’s compare the speed of the algorithms. As a unit of measurement, we can look at the number of comparisons used by each graph on similar inputs. Specifically, we’ll measure the number of times these lines are called:

if item < pivot:                # in quicksort


if left[i] < right[j]:          # in mergesort.

This unit of measurement is proportional to the total time each sort algorithm will take, is easy to measure experimentally, and is independent of any particular hardware it may run on.

The graph below shows the number of comparisons needed to sort every possible ordering of [1, 2, 3]. The height of each blue bar gives the number of comparisons used for a particular input. The input is represented by the smaller gray bars beneath the blue bar; for example, the left-most bar represents input [1, 2, 3] to quicksort taking 3 comparisons, while the right-most bar represents input [3, 2, 1] to mergesort. The n=3 label indicates the length of the input.

Both algorithms use either 2 or 3 comparisons in all cases — and neither one is clearly faster than the other for a random size-3 input; they both use 3 comparisons on 4 possible inputs, and 2 comparisons on the others.

Let’s move on to larger inputs. This time we’ll consider all 24 possible orderings of [1, 2, 3, 4] as inputs:

In this graph we can see that mergesort’s slowest case — the maximum number of comparisons—is better than quicksort’s slowest case.

Usually we won’t know the length of the input array ahead of time. With this in mind, we can look for a pattern by extending the above graphs for several larger values of n. For an input array of size n, there are n! different array orderings — that is, n(n-1)(n-2)…1 possibilities.

The bars in the next graph are drawn in order of height since otherwise they’d be difficult to visualize. The last row, with n=10, represents 10! = 3,628,800 different array orderings, so the graph is a compressed but accurate representation.

Something curious is happening in the quicksort column: a little spike has been forming, pushing up the maximum number of comparisons.

Let’s focus on the maximum value within each of these bar graphs. We can use this one number to summarize the performance of each algorithm at each input size n in the following graph:

The earlier bar graphs made it look as if quicksort was often about as fast as mergesort. Suddenly, this last graph is making quicksort look much worse. What’s going on?

Worst-case time complexity

Let’s dive into precise terminology for general algorithms — we’re ready to define the time complexity of an algorithm.

It’s nice to have a single function t(n) that expresses the time an algorithm takes in terms of n, the size of the algorithm’s input. It’s often true that there are many inputs of a single size — for example, many lists with the same length — so we have to decide how to represent all the running times for these inputs with a single number.

In the analysis of quicksort and mergesort so far, we focused on the maximum time used for a given list size. In general, the function

t(n) = max #time units used for an input of size n

is called the worst-case time complexity of an algorithm because it assumes the slowest possible performance based on knowing n.

An alternative to worst-case is the average-case time complexity, which instead measures the average number of time units used for a random size-n input.

Finding t(n)

The bar graphs above were made by running the algorithms on many inputs. This only works for finite n. Let’s try to find an expression for t(n) that doesn’t require any code to compute.

t(n) for quicksort

Below is an example of a quicksort run on input [4, 6, 2, 3, 5, 1]. The orange bars are pivot elements; gray bars separate different input sets in the recursive calls. The new variable nc(k) is the number of comparisons performed at each recursion depth k.

The variable nc(k) is useful since

total #comparisons = nc(0) + nc(1) + nc(2) + …

A recursive call with input s uses len(s)-1 comparisons, excluding comparisons made indirectly with deeper recursive calls. So nc(k) is the sum of len(s)-1 over all sublists s at that depth.

Let’s add two new variables that will help us precisely analyze quicksort:

  • ne(k) = #elements at depth k (e.g. the non-gray bars in the image above),
  • ns(k) = #sublists at depth k.

Here’s the above example with the variables for each step k:

Using these, we can rewrite the above nc(k) equation as:

nc(k) = ne(k) - ns(k).

This works because, at depth k, there’s one comparison per non-pivot element. And there are ns(k) pivot elements, so subtracting ns(k) from ne(k) gives us the number of non-pivot elements.

Notice that quicksort recursively calls itself on left and right where len(left) + len(right) = #comparisons performed before recursion; in other words, exactly one element is passed recursively down for each comparison performed. We can express this as ne(k) = nc(k-1). Combine this with our last expression for nc(k), and we arrive at

nc(k) = nc(k-1) - ns(k).

Our goal is to find the maximum #comparisons used by a run of quicksort — that is, the maximum sum of nc(k) over all recursion depths k. This last equation tells us that maximizing nc(k) is the same as minimizing ns(k), so let’s consider the smallest possible values of ns(k).

If ns(k)=0, then nc(k)=0 since no comparisons are done when there are no sublists. So the maximum value of nc(k) is nc(k-1)-1, which happens when ns(k)=1.

If we could make ns(k)=1 for as many k as possible, then we’d have nc(0)=n-1, nc(1)=n-2, etc, with nc(k)=n-k-1 at depth k. And we can make ns(k)=1 for as long as possible by sending in an already-sorted input such as [1, 2, 3, 4, 5, 6]:

This last example gives the maximum #comparisons for any input of length 6. So t(6) = #comparisons on [1, 2, 3, 4, 5, 6] = 5 + 4 + 3 + 2 + 1 = 15.

This argument shows that the maximum #comparisons for any n is found when we run quicksort on any already-sorted input, which would preserve ns(k)=1 for as many recursion depths as possible. This means that

This formula gives us the exact worst-case time complexity of quicksort. Not only did we find quicksort’s speed on already-sorted inputs, but we also carefully argued that no input could use more comparisons.

t(n) for mergesort

Mergesort has two stages: splitting and merging. All the comparisons happen during merging. As a reminder, here’s the merging section of the code:

def mergesorted(arr):
# 2. Merge together (left/right have been recursively sorted).
sorted, i, j = [], 0, 0
while len(sorted) < n:
if left[i] < right[j]:
i += 1
j += 1
if i == len(left): sorted += right[j:]
if j == len(right): sorted += left[i:]
return sorted

A merge ends when either left or right runs out, triggering one of the last two lines of the while loop. Until that happens, one element is appended to sorted for every comparison. At most len(arr)-1 comparisons happen during a merge, corresponding to a single element remaining in either left or right when the merge ends.

Can this theoretical maximum actually happen at every recursion depth for a single input?

The answer turns out to be yes — but we need to do some work to prove it. We’ll construct slow-to-mergesort arrays using a function called antisorted.

The following code accepts an input array arr of distinct elements and returns a new array with the same elements arranged so that every recursive depth of a mergesorted call takes as long as possible:

def antisorted(arr):
if len(arr) < 2: return arr
sortedarr = sorted(arr)
i = 0
left, right = [], []
while True:
i += 1
if i == len(arr): break
i += 1
if i == len(arr): break
return antisorted(left) + antisorted(right)

It works by choosing exactly which elements will end up in the left and right subarrays so that, during a merge, the smallest element is chosen from right, then from left, then from right, etc. The right side is filled first in the loop since, when n=len(arr) is odd, right has one more element than left in a run of mergesorted.

Intuitively, the first pass of antisorting — before any recursive calls — performs an operation like this:

The last line of antisorted makes sure that every recursive level of mergesorted also uses a maximal number of comparisons. We can mess with the ordering of elements within left and right in the last line of antisorted because, by the time the merge starts during mergesort, all recursion is done, and left and right have been restored to their sorted selves.

Antisorting examples

Here’s a picture of what antisorting arr=[1, 2, 3, …, 64] looks like:

Each step in the above image represents another level of recursion in the call to antisorted. The next image shows what mergesort does to antisorted([1, .., 8]). Each merge step has to zipper together alternating elements from the arrays being merged:

Antisorting maximizes the number of comparisons done at all recursion levels. As a result, the maximum total comparisons is exactly the sum of len(arr)-1 over all recursion levels. We can summarize this result with the recurrence relations below. The left pair of brackets around n/2 mean to round down, and the right pair mean to round up, like the floor and ceil functions. This is a nice way to handle both even and odd values of n in one equation.

Sadly, it’s not easy to turn this into a nicer, non-recursive expression. For now we’ll deviate to an approximation t’(n) based on the picture below, where each horizontal layer indicates a recursion level in a mergesort. This picture introduces the function lg(n) which is the base-2 logarithm of n.

Each blue bar represents a portion of the array being considered by a mergesort call; the values like n or n/2 give a size estimate for the input to that mergesort call. This estimate won’t be accurate whenever the previous call has an odd-sized input, and our value of h is clearly wrong when lg(n) is not an integer, but this picture can still give us some intuition about how many comparisons are occurring.

Notice that each horizontal layer has exactly n elements, and the layers have 1, 2, 4,... blue bars each. So the number of comparisons for a single layer is n-#bars. Adding up powers of 2 is nice since we can use the equation

We can use that fact to add up the number of comparisons, that is, n-#bars, across all layers:

Let’s plot actual discrete values of t(n) using the recurrence relation against the approximate function t’(n):

Looks pretty good!

Unfortunately, we have no careful proof of a connection between the approximation t’(n) and the exact function t(n). Can we regain proven information about mergesort without having to solve the above recurrence relation?

A sweet spot for big-oh

The motivation for big-oh notation has just ambushed us. It’s perfect for our mergesort analysis — it gives us a way to briefly summarize t(n) without finding a non-recursive expression for each value.

The important fact about t(n) that we want to verify is something like

which can be written more formally as

At each recursion level of mergesort, all of the n elements have been split up into sublists to be sorted. So the number of comparisons at any fixed level is always n.

The longest sublist after the first split — call this level k=1 — has length n/2 + 1/2. At level 2, the longest sublist has length n/4 + 1/4 + 1/2. In general, we cut the last longest sublist in half and add 1/2 to allow for splits of odd numbers. This means the longest sublist at recursion level k is

By the time k=lg(n), the longest sublist is < 2. So no comparisons can occur after recursion level h=lg(n).

We can summarize these ideas in the following picture:

In summary,

Since lg(n) = log(n) / log(2), we can set C = 1/log(2) to get

which satisfies the definition of big-oh; we can re-write it as


Next let’s find a big-oh expression for quicksort’s t(n).

Earlier, we found that

Let’s turn that into an equality that can be used in a big-oh expression:

We’re now able to beautifully summarize the worst-case running time of our sorting methods:

By this measure, mergesort is the clearly superior algorithm.

However, there are slight modifications to quicksort that can improve its speed in some cases, and I’ll mention one below.

Average-case complexity

Remember the distribution of running times we found earlier:

It looks like the worst-case for quicksort is isolated to a small subset of inputs. It would be nice if we could give quicksort some credit for being as good as mergesort most of the time. Average-case complexity allows us to overlook slow-but-rare inputs.

Worst-case time complexity was defined like this:

<worst-case> t(n) = max(#time units needed for any length-n input)

Average-case complexity is defined analogously:

<avg-case> t(n) = avg(#time units needed for any length-n input)

An interesting technical question pops up: What is the probability distribution on the inputs? For example, for a string algorithm, should we consider all valid UTF-8 strings as equally-likely inputs — or are natural language strings more likely? The definition of average-case complexity doesn’t specify your probability distribution. In other words, it’s up to you to choose the distribution that makes the most sense, though it’s common to assume that all inputs of the same size are equally likely.

For sorting inputs of size n, we can use permutations of 1, 2, …, n to represent each possible ordering of an input, and treat each permutation as equally likely. With that distribution, the average-case time complexity of quicksort is O(n log n), though it’s a bit of work to figure that out.

Randomized algorithms

Quicksort’s average time is O(n log n), but it’s still slow on presorted input. This can be a problem in practice. An easy improvement is to choose the pivot element randomly.

The modified quicksort would insert these lines:

pivotIndex = random.randint(0, n — 1)
arr[0], arr[pivotIndex] = arr[pivotIndex], arr[0] # Item swap.

just before the line that sets the pivot:

pivot, left, right = arr[0], [], []

Now the running time on any input — including a presorted list — is itself random. Well, pseudorandom in practice.

Since the running time is random even for a fixed input, it’s reasonable to look at the expected running time. In other words, t(n) now represents an simultaneous average over both all length-n inputs and all possible pseudorandom parameters — in this case, our pivot choice. In the case of random-pivot quicksort, the expected running time is the same as the average-case time for the non-random version — O(n log n).

Amortized analysis

We’ve talked about two probabilistic ideas so far: averaging over many inputs, and about algorithms using random parameters. A third variation of average-case complexity is amortized analysis, where averages are taken over a group of consecutive function calls.

Suppose we have an array of integers held consecutively in memory; we can add new integers to the end of the array with an add function. This function takes one fixed-size integer as input, and appends it to the end of the current list of integers. Our array pre-allocates a little extra space to anticipate some add calls. When add is called and no extra space remains, a new block of twice as much memory is allocated to make extra room for incoming elements, and the old elements are copied to the new location:

In this section, we’ll use the variable n to indicate how many add calls have already occurred. Since we didn’t bother to make a remove function, this means there are exactly n integers stored in the array.

Adding an element to a full, no-extra-space array with n elements requires n+1 memory writes, which is our measure of time here. In other words, a single add call could take arbitrarily long, even though it has only one fixed-size input!

But! We can still say something meaningful about the running time. Calling add n times in a row, starting with an empty array, results in this many memory writes:

The amounts in parentheses capture the number of writes to copy the existing elements to a new location, plus one more write for the new element. We can rewrite this expression as

Let T(n) denote the number of memory writes used in n consecutive add calls. We can summarize the last result as T(n) ≤ 3n.

Now let t(n) be the average number of memory writes per add call, after n calls to add. This is simply t(n) = T(n)/n ≤ 3. We can summarize this as

which we could phrase like so: The amortized running time of add is constant. It’s called constant time because the equation t(n) = O(1) means there must be some constant C so that t(n) is always ≤ C for all n after a certain point.

Now we’ve seen a few key examples of how big-oh can be used, as well as explored different ways of choosing a single value t(n) that represents an algorithm’s efficiency. Let’s step back and review some general properties of the notation.

What big-oh does for us

Big-oh notation cuts away the cruft of a function to identify the piece that best captures growth as n goes to infinity.

For example, we could write

The right-hand side is a great simplification of the growth of the left-hand side.

If we were feeling a little crazy we could write

which is technically true, but completely misses the point of big-oh.

There’s a spirit-of-use behind big-oh notation. When we write f(n)=O(g(n)), we also mean that g(n) is the best — smallest and simplest, intuitively— function that we can prove works. So writing n=O(n²) is true, but weird because n² is clearly not the smallest function that would work inside the big-oh.

Now is a good time to ask: Do you see the flaw in the logic below?

Let q(n) be quicksort’s worst-case time complexity, and m(n) be mergesort’s. We know that

Therefore, in worst-case terms, mergesort is faster than quicksort.

The flaw is that we could have just as easily — and correctly — written that

To drive the point home, this line of argument is like saying x≤3 and y≤4, therefore xy. Yet clearly we could have, say, x=2 and y=1.

This is a weak point in the traditional usage of big-oh notation. There are actually other letters which can be used for other function comparisons.

The notation f=O(g) is sometimes written when in fact f=θ(g) is known.

The corrected version of the above argument goes like this:

therefore, mergesort is faster in terms of worst-case complexity.

The moral is to avoid using big-oh notation to say an algorithm is at-least-as-slow-as a function. Technically, big-oh only provides at-least-as-fast-as guarantees. Theta or big-omega notation can be used for other guarantees.

Common big-oh expressions

A few common functions have specific names in the context of big-oh notation. As usual, suppose t(n) is the time complexity of an algorithm.

Some functions have surprising big-oh expressions. One of the coolest comes from Stirling’s formula:

I love that one.

Technical definitions

This section uses a bit more math than before, including some limits and a few formal logic symbols. If you want to skip this for now, please jump down to the big picture section for continued limit-free fun.

There are a couple equivalent ways to define the expression f(n)=O(g(n)).

First, I’ll restate the definition we’ve already seen using the symbol ∃ for there exists, and ∀ for the phrase for all. It’s also customary to read a colon after ∃ as the phrase such that:

Definition 1

Mathematicians may find this next definition more elegant:

Definition 2

Unlike standard limits, the limsup is guaranteed to exist, and is defined as

The sup stands for supremum, and indicates the least upper bound on a set of numbers.

The limit definition nicely supports other variations. For example, f(n)=o(g(n)) — that’s a little oh — can be defined as

which is an even more elegant expression.

Something weird

Big-oh notation has its peculiar flavor of inelegance, though. It does something horrible and drastic and freaky — it redefines the equal sign.

Using the traditional meaning of the equal sign, the argument below would be logically sound.

Obviously not, though. As soon as big-oh shows up to an equation party — or its cousins theta, omega, etc — the equal sign loses symmetry and acts more like a < sign. This is a weird thing to do that feels to me like a notational mistake. But it’s stuck with us as an established standard.

Complex expressions

This post focuses on equations of the form f(n)=O(g(n)), but in math there are other ways to use big-oh. For example, it’s nice to be able to write

even though the right-hand side isn’t of the form O(g(n)). By the way, the value γ=0.5772… in the above equation is the Euler-Mascheroni constant, defined as

It’s a fun constant.

Even more complex expressions are possible, such as this one for π(n), the number of primes ≤ n:

In the above cases, we can think of each side of the equation as the set of functions consistent with the big-oh expressions in that side, and the equal sign as a subset relation. For example, the equation

can be interpreted as

Limits other than ∞

Another assumption in analyzing algorithms is that we care about behavior when n goes to ∞. In other contexts, other limits L may be interesting, and are easy to work with using the limit definition of big-oh and little-oh notation:

Let’s see some examples around the limit point L=0.

These functions illustrate the equation

Another use case with L=0 can be found in terms of derivatives:

Big picture

Here are the major steps in producing an algorithm’s big-oh analysis:

  • Decide what unit will be measured. For example, time or memory used based on the size of an input.
  • Summarize values across all possible size-n inputs. For example, worst-case or average-case analysis.
  • The previous two decisions are independent of big-oh notation. Once we have a function like a worst-case time complexity t(n), we’re ready to look for a big-oh expression for t(n). There is no recipe to accomplish this. Sometimes it’s easy and other times you’ll have to be creative depending on the complexity of the algorithm.

We don’t ever have to find an exact expression for t(n). We didn’t when we studied t(n) for mergesort. Skipping that step, while still gaining understanding of t(n), is a big advantage of using big-oh notation.

To wrap up, let’s take a critical look at what’s good and bad about big-oh notation.


The bad

Big-oh notation hides details that may be important. If two algorithms have complexities t(n) and u(n) with

then it’s hard to choose between the two. Such a vague summary of these functions may be hiding a vast difference. For all we know, we could have

in which case it would be embarrassingly understated to note that the t-algorithm is faster than the u-algorithm.

The ugly

When applied to the analysis of algorithms, t(n)=O(g(n)) often means either t(n)=θ(g(n)) or something nearby like “t(n)=O(g(n)), and this is the best-known bound.” Sadly, this is easily confusing because it results in people using big-oh like theta, when what is written down is a statement with less information.

The good

Despite the setbacks, big-oh notation is ridiculously useful for summarizing weird expressions like

and for allowing us to understand functions like

which are otherwise hard to describe. Thanks, big-oh!

How I learned this stuff

I originally learned the fundamentals of algorithmic analysis through a textbook accompanying a college class on the subject. That book is Introduction to the Theory of Computation by Michael Sipser, and it’s a great book. It’s easy to read, doesn’t omit technical detail, and covers a good set of topics in the theoretical foundations of computer science.

However, I think the single best source for this material is Concrete Mathematics by Ronald Graham, Donald Knuth, and Oren Patashnik. I have the 2nd edition, in which chapter 9 covers big-oh notation in great detail with useful subtopics and fun exercises. This book is another personal favorite.

If, like me, you find yourself wondering purely mathematical things such as — Just how fast can functions grow? Are all functions linearly ordered by big-oh? — then I highly recommend Orders of Infinity by G.H. Hardy. This is an older book written in 1910. Despite the book’s age, it’s surprisingly readable and filled with delicious, mathy morsels.

If you enjoyed this post, please follow the Of Games and Code collection!

Have fun with big-oh!


  1. Precisely describe all arrays that require the maximum number of comparisons in our quicksort implementation.
  2. Find a big-oh expression for the running time of the antisorted function.
  3. Find an non-recursive expression for our mergesort’s worst-case time complexity t(n).
  4. Prove that the different definitions of f=O(g) given above are equivalent.

Tools Used

I used LaTeXiT to generate the math images, followed by some minor post-processing in Photoshop and Keynote. Several of the sorting algorithm graphs were generated with custom Python scripts utilizing PyCairo for image generation; this code is open source here.



Tyler Neylon

Founder of Unbox Research. Machine learning engineer. Previously at Primer, Medium, Google.