Recently, I was engaged in discussion on Facebook about the value of a formal computer science education for web developers. It got me to thinking about some of the differences I’ve noticed in the code typically written by developers who have CompSci degrees versus those that don’t. One difference that jumps out at me is that developers without formal training often don’t think about algorithmic efficiency when they write code that transforms data sets from remote service for their UIs. Often, the difference isn’t noticeable when working with test data sets, but when the application moves into acceptance testing or production these routines start to suffer from poor performance.
The toughest undergraduate course computer science course I took was a graduate-level algorithm analysis course. It was also one of the most useful in that it taught me to think about algorithmic complexity and identify patterns that would result in poor performance as the size of the data set increases. Now, I concede that your average web developer doesn’t need to understand things like how data is moved between processing cores in a hypercube! However, understanding how the performance of a function changes with the size of the data set it works on can be critical to ensuring that your application doesn’t fall over when you put it in front of real users with real data.
Function Order and Big-O Notation
I promise to keep the university lecture short. When we talk about how an algorithm performs we often talk about the “order” of a function, which we express in “big-O” notation. Big-O isn’t as complicated as it may look. It simply describes how a specific measurement of performance like execution time or memory consumption changes as the size of the data set increases. This is easier to understand by example.
Let’s say we have a data set with 1,000 string elements in an array called dataSet. If we wanted to loop through dataSet and print each element, our code might look like this:
..or if you prefer a functional approach:
In either case, these routines execute console.log() once for each item in the array, 1,000 times in total. If we add another 9,000 rows to the data set, it executes console.log() 10,000 times. The time it takes to execute this algorithm grows linearly as the data set grows. In Big-O notation we describe this function as O(n).
If we want to identify duplicates in the array, we could use an algorithm like the following:
When this algorithm runs on our 1,000 element data set, it executes the inner loop 1,000 times for each element in the array, or 1,000 * 1,000 times, or 1,000², 1,000,000 times total. If our data set grows to 10,000 rows, it executes 10,000² or 100,000,000 times. In this case, execution time grows proportionally to the square of the size of the data set, or O(n²).
Now, the algorithm above is poorly written on purpose to provide an easy example for illustrating an O(n²) algorithm. If we’re trying to determine if there’s a duplicate in the data set, we don’t necessarily have to step through the entire data set. We could short-circuit the inner loop once we find the first match. For the purposes of analyzing performance though, we want to evaluate the worst-case scenario, e.g. the element we’re looking for is at the end of the data set. So even if we rewrote the algorithm as follows:
…we still look at this algorithm as O(n²).
We can get more detailed, if we need to. If we were to compare the following two algorithms:
…the second algorithm prints each element twice. Given that printing the element consumes some amount of time, you could compare these two algorithms as O(n) for the first, and O(2n) for the second — the second algorithm takes twice as long to execute as the first. While it’s true that the second algorithm will be slower than the first algorithm, when compared to an O(n²) algorithm that amount of work becomes insignificant, so the second algorithm would still be considered an O(n) algorithm.
We now have enough of a framework to do some basic analysis and comparison. An O(n) algorithm (simple loop) will be more efficient at its task than an O(n²) algorithm (nested loop), which will be more efficient than an O(n³) algorithm, etc.
And Now, To the Point…
I’m sure you’re thinking “Hey, this was supposed to be about something called the Dictionary Pattern.” It is. More specifically, it’s about how to identify O(n²) algorithms in your code, and how you can apply this pattern to optimize them.
In the old days, before HTML5, I was an Adobe Flex developer. Flex had a data structure called a dictionary — basically this was a key/value store. Keys exist only once, and each key has an arbitrary value which can be retrieved using its key. The key feature to grasp in the context of this article is that retrieving a value by its key is an O(1) operation — no matter how many keys and values are in the dictionary, retrieving the value by key takes the same amount of time.
Remember back to the example where we were looking for duplicates in our data set:
We might write functions like this one for a variety of reasons. We might want to know which items are duplicated in the list, or to determine how many times each item exists in the list. We can approach these objectives using the concept of Java’s HashMap. For example, to count the number of times each item occurs in the data set, we can use an algorithm like this:
This routine loops through our data set, creates the key/value for the item if it doesn’t exist in the map, then increments the value. When it finishes executing, the map will contain a key for every string in the data set, the value of which is the number of times it occurs in the set. What’s important to understand is that we’ve taken what was an O(n²) algorithm and converted it to an O(n) algorithm. As the data set grows, the amount of time it takes to process grows linearly and not as a square of data set size because we can access the count for each item using an O(1) lookup instead of looping through the data set repetitively and counting the occurrences of a specific value.
Perhaps instead of counts, we just want a distinct list where we distill only the unique values in the data set. We can approach that from the perspective of Java’s Set, using a similar technique:
The result here is that we loop through the data set once in the forEach loop, and then we iterate over the keys of the set using Object.keys(). Worst case scenario is that the data set doesn’t contain any duplicates in which case this is an O(2n) algorithm.
If you wrote something like this to look up customer names for each invoice:
that’s an O(n²) algorithm. It might work fine in development where your data set only contains a small number of customers and invoices. Once your data sets expand to thousands of customers and tens of thousands of invoices, performance will start to suffer.
We can rewrite this algorithm using a HashMap-minded approach. First, we loop through the customers array and organize it in an object using customerID as the key and customerName as the value. When we loop through the invoices, we can access each customer’s name using an O(1) lookup instead of an O(n) lookup:
This algorithm is O(2n), which will perform significantly better as your data set sizes increase.
Look for nested loops as tell-tale signs of algorithms that don’t scale well. Understand that nested loops can be hiding in your functional programming constructs such as Array.find() calls inside of an Array.forEach() loop.
You can use the Dictionary Pattern, using anonymous object as dictionaries, to improve performance for various transformation and lookup tasks on your data sets.