Heat Death of the Universe and Faster Algorithms Using Python Dict and Set

There are some things we cannot compute. Or perhaps I should say we should not try to compute them using an algorithm based on simple brute force and ignorance. Let’s take a quick trip through the galaxy of of algorithms and complexity. Along the way, we’ll pick up some tips on making our software run faster.

My vague understanding of the Big Bang includes a (possibly incorrect) idea that it expands and gets colder. The fact that sticks in my head is that the Andromeda galaxy will collide with our galaxy in about five billion years, 5E+9. On average, there are about 3.2E+7 seconds in a year. Let’s call it 5E+9 * 3.2E+7 = 16E+16 seconds. More-or-less.

Big number, right?

Unless we’re implementing an O(n!) kind of algorithm. Then it’s a very realistic part of our future.

Let’s say our CPU takes 10 nanoseconds to do a Python float multiplication. A billion, 1E+9, multiplications take barely a second. That leaves us time to do 1.6E+26 multiplications before we’re wiped out by celestial disturbances.

Let’s say we’ve got an optimization problem we want to confront. We need to locate the perfect (Perfect!) permutation of just over two dozen different operations. Maybe we’re trying to densely pack 26 items into a finite volume, or optimize routing among 26 taxis, or any of a large number of similar problems that involve checking a number of alternative arrangements of things.

There are 26! permutations, 4E+26. To confirm the perfect permutation, we’ll still be computing well after a galactic collision makes life on Earth unlikely.

We use “Big-O” notation to summarize the overall complexity and the way the complexity grows with the number of data values. A O(n!) optimization for n=3 or n=4 values has 6 or 24 permutations. But larger number of values to optimize? Like the inflation theory of the universe, the computation gets really big really fast.

Are There Other Things We Can’t Compute?

Here’s a simple word problem to think about:

Assume we have a collection of numbers. We want a subset of those values with a value closest to a given target.

It turns out, for a collection of n values, there are 2**n subsets. If we have a collection of 88 values, galactic collisions will happen before we’ve found the perfect subset.

These kinds of problems are insidious. We can sometimes be confronted with “clandestine combinatorics” — problems which involve permutations or subsets — by surprise.

Here’s another example to think about. A rapidly-growing pair of web-startups have a lot of customers, some of whom are duplicates. How can they compare all of the customers and remove the duplicates? The comparison is more than a quick little multiply or add, it takes 100 microseconds to do a database fetch and complete a comparison of the various fields. 10,000 comparisons take a second.

If we’ve got ten million customers, 1E+7, this means there must be 1E+14 comparisons.

We’re talking about code with the following kind of structure:

for record in company_1:
for record in company_2:
if match:

This kind of nested loop has a complexity described as O(n²) . Each of the n customers has to be compared against the remaining n-1 customers. The “Big-O” notation rules discard additions and subtractions when discussing the general order of complexity of the computation. For n=10,000,000 we’ll do operations: 1E+7**2 = 1E+14 comparisons. At 1E+5 per second, this will take as long as 1E+14/1E+5 = 1E+9 seconds. That’s 31 years, just shy of the time it takes light to reach us from a star called 61 Ursae Majoris. This computation is not waiting until the twelfth of never, but it will take a long, long time.

If we speed the comparison up by a factor of 100, we’re still talking about four months of computation. Spread over 64 concurrent processors, it’s nearly 45 hours of work.

What Can We Do?

We can characterize all of these algorithms as kinds of “search”. One example searches through 26! Items looking for exactly one perfect fit. Another example looks through 288 subsets for an exact match. The first step of improving the performance of our algorithms is to be aware of when we’re designing software that involves a search. With a three-inch telescope, we can see about 5 million individual stars, there are a *lot* more

Next, it’s helpful to be sensitive to O(n!) and O(2ⁿ) situations. These aren’t too common, and they’re tricky to code. We often get frustrated trying to keep track of all the variables involved. When the algorithms get complex, that’s a hint that we should look for a clever approximation. There are a lot of brilliant libraries for this, and it’s important to recognize the trade-off decision: an approximate answer right now is better than the right answer a few billion years from now.

Beyond the two astronomically bad cases, we also want to avoid as many O(n²) operations as we possibly can. If we’re filling in the cells of a matrix, and the matrix has millions of rows (and columns) we’re not going to be happy with the results. Waiting months or years doesn’t seem ideal when business value is measured in seconds.

In some cases, we’re not working with a literal matrix data object. Any time we have nested for-loops in Python, there’s a distinct possibility we’re doing the same amount of work as filling in the cells of a matrix.

Looking back at customer deduplication, a better approach doesn’t try to compare every customer against each and every other customer. We need to consider a Divide and Conquer strategy. Using a bisection algorithm will lead to a complexity described as O(n log₂n); this is closer to 2.3E+8 comparisons, a considerable savings in time. At 1E+5 comparisons per second, we’re only talking about 40 minutes, the time it takes light to race from the sun to Jupiter.

In Python, the bisect module provides a way to rapidly search a sorted list. If we’re going to do repeated lookups, the cost of an initial sort may be amortized over a large number of less-expensive bisect.bisect() operations.

Consider a common Natural Language Processing (NLP) problem of removing “stop words” from a document. The list of stopwords is relatively small — often under 200 words.

A brute force filter would compare every word in a source document with every stop word. This would be O(d⨉s), where d is the size of the document, and s is the size of the list of stopwords. This seems fast for a 1,000 word document and 200 stop words: 100,000 comparisons will take much less than a second. (1,000 x 200 / 2 because we’re only really filling in half of a symmetric matrix.) But. If we’re serving several dozen API requests per second, this starts to turn into a bigger and bigger compute burden.

If we use a bisect lookup, we can turn this into O(d⨉log₂s), or about 8,000 comparisons instead of 100,000.

A closely-related approach is to replace simple loops with slightly more clever sorting. This, too, can change O(n²) to O(n log₂n). When we’ve sorted the items, they can be readily partitioned into relevant matching groups, since the next bunch of records will be more closely related than distant records.

More Ways to Do Less

The bisect module chops a search algorithm from O(n) to O(log₂ n). This can be a huge improvement. Creating the sorted list isn’t without cost, but if the cost can be amortized over many operations, it lets us shoot for the moon.

Using a set or a mapping changes bisection O(log₂ n) search into an O(1) lookup. O(1) means constant time; the search doesn’t vary with the size of the set of data. Sets and dictionaries use a hashing algorithm to transform a key object into a number that identifies the value directly. There’s some subtlety in handling space allocation and collisions, but the speedup is dramatic.

Returning to the stopwords example, checking set membership has a huge advantage over using bisect to search a list.

We’ve got three common approaches to search:

  • Brute force examination of a list. O(n). If this is done repeatedly, it becomes a black hole of compute time, tying up the CPU with results that don’t seem to get out to where we can see them and use them.
  • Sorting and using bisect to locate an item in a sorted list. O(log₂ n). This requires a little bit of care, but it’s only a kind of neutron star, pulling in a lot of resources. We can speed this up and produce useful results.
  • Hashed lookup into a set or dict. O(1). A 70 nanosecond hash computation is the time it takes light to travel about 21 meters. Doing millions of these takes less than a second. This is stellar performance,

Tools and Techniques

Big data and data science mean we have to exercise a little care before we get started. We can’t dump a lot of data into a dataframe and start up a computation and hope works. We need to spend a little time doing some computations to be sure the computation will finish in our lifetime.

There are several overall parts to this.

  1. Understand the outer loop structure, the Big-O complexity.
  2. Understand the cost of the computation inside the loop structure. When working in IPython or Jupyter Notebooks we can leverage the %%timeit magic to get timing details for a computation.
  3. Plug in the concrete values for the timing and sizing.
  4. Decide if this is worth doing this way, or if more thinking is required.

For example, let’s say we have nested for loops, with an O(n^2)complexity. We can use %%timeit to see the inner part of the loop that takes 220 “n”’s to do a computation. When “n” is 15,000, we’ve got 15,000*15,000*220*1E-9=49.5 seconds of work. Not too bad for some kind of analytical work. Maybe less than ideal for a web transaction.

This won’t give us the total cost for the computation; there are other OS and language overheads that are hard to predict.

Before starting, we need to know if we’re going to take 70 nanoseconds to go 20 meters; or will we be spending four hours traveling to Neptune and 375 years to the Pleiades; or, have we started on a two million year journey taking us to the Andromeda galaxy?

There are times when working with a simple list or database query will take far too long. Using a sorted list, and the bisect module, or replacing a list with a Python dict and set can change the journey from hours to seconds.

Related:


DISCLOSURE STATEMENT: These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with, nor is it endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2019 Capital One.