The year was 1998 and our VB 4.0 applications connected to databases using JDBC to handle a small amount of data. Since 2003, our web applications and today’s API-based application layers continue to follow fairly simple patterns to process low volumes of data. In 2018 we’re building microservices to handle streaming data at velocities and volumes that we couldn’t imagine 20 years ago. This is enabled by an extremely fast message bus that enables throughput that has only been previously available via batch/ETL tools. But all of this data can cause problems with our databases. First, let’s look at the wrong way of getting data in a streaming environment and then we’ll show how to make it better.
Naïve Fast Data
We have a pipeline of microservices that will filter and enhance email messages. Our input will be a never ending, extremely fat pipe of raw messages. One of the components we are adding will be a blacklist of known spam traps or bounced email addresses. For multiple reasons, we need to guarantee our system will not send those emails out.
Our blacklist is simple, it’s a list of 60 million known bad addresses (Bear with me, this isn’t a real example, a 60 million item list of bad emails is most likely not the right approach in the real world). The initial implementation of the ‘blacklist’ verification microservice was built naively — for each message to check, the service would establish a database connection (let’s roll with Go), query the blacklist table (blacklist), and check if there was an existing item. If there’s an item, then our blacklist implementation will drop the inbound message on the floor, and forget it ever happened.
I’m not even going to run this test since we know it’s the most naive implementation of our blacklist. Each message would connect to the database, query, and close the connection. Even with a fairly large sized database server, I’d imagine the best we could get out of this would be about 10–15 TPS (transactions per second). How big would the DB have to be, and how wide would we have to scale our application, in order to meet 500 TPS? 500 TPS becomes 180k records per minute. That would be a reasonable throughput!
Fast Data Access
Let’s figure out how to do this right. To make this fast enough, we want to implement a caching layer. Caching may not be the right word… Traditionally, a programmer will look in a cache for a specific piece of data, if it isn’t there, the code will then check the database as the true source of record. Once the data is pulled from the database, it’s inserted into the cache. This builds out the cache slowly and only as data is accessed for the first time. Instead of populating the cache over time, we’ll be populating it ahead of time in one operation, “select email from black_list”, to get every single name in the blacklist to insert into our lookup structure (called a cache here).
Immediately, we know that caching means we are no longer ‘truly’ a real-time pipeline, and that’s okay. In a real application, we’d figure out how to make updates to our cache, but we won’t cover it in this post.
The cache will act as the data store for our microservice. Since we’re concerned about extremely fast data lookups, we need a read-only cache. Our entire data set of 60 million records needs to fit in this cache. 60 million records may sound like a lot of data, but we’ll make it work.
A few different ways to implement this comes to mind:
- Query the DB for each record.
- Use an intermediary cache tool.
- Choose an in-memory local data structure.
You can think about each approach as a step along a path to save data. We typically deal with millions and millions of records, instead of hundreds. In order to efficiently store and process all that data, we need to shrink it down into a consumable chunk of memory.
Option #1 is what we covered above. Each transaction requires a new connection to the database, a query and processing of the results. This will result in a lot of network and database overhead. Without intentionally writing a slower process, this is the slowest way to build our lookup, so let’s not talk about it.
Option #2 is not a bad option. There are actually a ton of reasons that it could be the right option. If we look at something like Memcached, Redis or Hazelcast, we’d get the benefits of a system that multiple clients/microservices can connect to at the same time. These technologies all have the advantages of being commercially supported third party applications or popular open source systems with a large community of practitioners. For instance, If we set up a Redis system using AWS Elasticache, then we’d probably be off to the races pretty quickly. Redis’ performance screams and can easily handle the load we’re talking about.
But we’re also going to skip this option. Didn’t think I’d say that, did you? Of course not! Redis et al would work great! But not as great as Option #3.
Custom In-Memory Lookup
A trie is a tree like structure. It attempts to remove duplicate data by building paths in memory. For example, the email addresses: email@example.com and firstname.lastname@example.org (not a real address of mine) would map out like this:
For keeping a list of emails, we know the TLD (top level domain: e.g. .com, .edu, .info, etc) is a pretty finite list. The metrics for the ‘reverse trie’ flips the above. At the top, would be the TLD and the last node for these would be `c`. When dealing with two email addresses, it doesn’t improve anything, but, when dealing with 60 million, the difference becomes noticeable.
Imagine how many email addresses will have the same TLD and Domain? @gmail.com, @yahoo.com, @aol.com, @outlook.com. Now we start getting data savings.
In my initial implementation of the Trie lookup, I couldn’t load the 60 million fake emails into memory without swapping. It surprised me, but it was what it was. A simple ‘reverse’ of the email string, and I was able to load the data pretty quickly and efficiently. Or so I thought (more on that later).
The other data structure I used was a Bloom Filter. A Bloom Filter is a probabilistic data structure. I’ll talk more about that below when we talk about false positives. A Bloom Filter allows a compact form of lookup data to represent items in a set. The way I mentally visualize a bloom filter is to think of a giant coordinate plane (remember geometry?). It’s a matrix, or a two-dimensional array. Let’s make up an example hashing representation:
Take a word like chris, hash it, and you can imagine you end up with a representation that means:
c == (2,0)
h == (2,1)
r == (2,3)
i == (3,1)
s == (3,3)
In this example, a c will always hash to (2,0), an h to (2,1) etc.
Building out a hashing algorithm is the first step. In this example, we can visualize the matrix in which we’ll store data as having the same structure. Yet to begin, each box is empty. Each box represents a ‘bit’, or a Boolean (0,1 or true/false) value holder. When we add the word chris, our code will flip the corresponding bits to 1. After we add the value, we can then ‘check’ if the word chris exists by rehashing and checking all the squares. Awesome! We haven’t saved any data yet… but, let’s add the word charlie. Then we flip the corresponding bits and the c, h, and i are still on. we’ve saved 3 bits! Continue adding words, and flipping bits until your whole data set is represented.
Bonus: what would supercalifragilisticexpialidocious look like?
Hopefully, you’re quickly realizing that an array this small can’t work - ‘dog’ and ‘god’ would flip the same bits, ‘master’ and ‘tamers’ would hash out and flip the same bits. There’s no way to know which word was used to flip the bits. Fortunately, the smart data scientists who came up with this approach has a solution. We estimate how many items we will have (back to our 60 million list of emails), and our library or code can build a much larger grid, with a very different hashing algorithm. As long as the values hash consistently, we can be confident of our data.
Warning: the smaller the grid, the more likely we will have duplications, also known as False Positives. While a Bloom Filter may have false positives, it will never have a false negative. Look into the Go library for Bloom Filters, it explains how to size
Additionally, it would be negligent to leave out another important aspect of a Bloom Filter. Above, we visualized a single coordinate plane with the raw data being mapped into it.
‘dog’ == ‘god’
In the real world, with email, we’ll also hit a lot of those false positives.
Hence a thorough bloom filter implementation will have two additional characteristics. Instead of a singular plane of raw data, the bloom filter will first ‘hash’ the inbound/raw data. Let’s look at a sha1 hash. Bloom Filters won’t typically use a cryptographic hash like the SHA family, but instead they’ll use a family of fast hashes that spit out different results of the same length. This is an example.
Note: echo -n suppresses a new line character. shasum -a 1 uses the sha1 algorithm.
We have a much longer string (b6b0756bb8a354ddcecc1f89d88cb69646b0c18e) to add to our Bloom Filter. SHA1 isn’t the best algorithm, since there are known duplicates, but what if we hash our data multiple times, and then we keep a “list” of coordinate planes? Our data structure can run a fast sha1 and check the first coordinate plane. If there’s a “hit”, meaning that value is in the coordinate plane, we can move onto the next algorithm. Let’s run a sha224:
If there’s a “hit” then we can run a sha256:
If there is a “hit” on all of the filters (sha1, sha224, sha256) we can claim that we have that value (email@example.com) in our Bloom Filter. If ANY of the coordinate planes don’t hit, we can be confident our data is not in the structure. My colleague Jon Bodner suggested thinking of it as a sieve or a linked list of coordinate planes. Step from one to the next, looking for a ‘miss‘.
As I mentioned earlier, in a Bloom Filter, it’s entirely possible to get false positives. However, we can NOT get a false negative. This is an aspect of a Bloom Filter that needs to be known and purposefully decided on when determining if it’s the right data structure for your application. We can mitigate this risk by using ‘more’ algorithms of larger size.
For this data caching system, I measured three pieces of data:
- Load Time (seconds)
- Memory Used (gigabytes)
- TPS (Lookups per second, via an API)
Load time is important. Worst case, we can recover quickly if our application crashes by reloading the data from the database and getting moving on again. In the chart below, you’ll notice -1 values for the Trie. That’s because the original trie implementation failed and was unable to load all the data into memory. It was a bummer and my first thought was that I did something wrong. I decided to flip the data (reverse the string) and load it. That worked wonders! That tells me my theory on optimization worked. So don’t give up if you fail at first! Smaller numbers are better here.
Memory used is also important when we think about application deployment. If we can utilize a larger number of small instances in AWS EC2, or AWS ECS, or Fargate, then we can save money. If one approach took 30GB of RAM, that wouldn’t be good. Smaller numbers are better here as well.
Lastly, I wanted to measure the lookup speeds. My approach was to build the data structures (Trie, Bloom Filter) and throw a quick HTTP GET API on top of it. I wrote a test load running application that came up with a random string using the same algorithm that I created the 60mm test emails with, and started hammering the API with them. The number shown below is TPS, or transactions per second. Remember our goal is approximately 500TPS or more.
These results didn’t surprise me too much. I thought the Bloom Filter would win, but I didn’t expect there to be a huge gap in all the metrics. Frankly, there shouldn’t be THAT big of a gap. The raw data file isn’t anywhere near 8GB in length, so I knew that something wasn’t quite right. I came away impressed by the Bloom Filter. This provided the compact structure, the short load time, as well as the throughput that I desired.
Remember earlier when I said the Trie loaded data efficiently and what not? I was wrong. The Trie was the furthest thing from efficient that I could imagine. Taking a 1.5GB file, that explodes into over 8GB of RAM, is the opposite of efficient. Looking back at the tree structure, every time you add a new ‘letter’ you create a node. That node builds empty pointers to 40 other nodes for the ‘next’ letter. Each level down adds another layer of letters, and each node in that layer has 40 pointers. In Go, that pointer is an 8 byte data structure on a 64bit machine. The more long words we have, the more our data usage sprawls.
See all those boxes? Want to know how many words you can fit in that data structure? Two. And that’s because you can have two words that start with the same letter. Let’s say our words are a and an. Two letters requires three are three levels. I’ve only added 40 nodes below one letter on each level. 120 boxes (empty pointers) are needed for a two letter word. Can you see why this can be a bloated data structure? The longer the words, the more bloated it becomes.
Tries are excellent, but check your use case carefully.
This was an exercise to show that there are reasons why we have different data structures and algorithms for challenging problems. If you’re like me, you’ve heard the adage, “If you have a hammer, everything looks like a nail.” Or, if you go to a surgeon to diagnosis and treat a problem, you probably will end up in surgery.
As computer scientists and data scientists, we need more than one tool to solve our problems. So, try a different language, try a different data structure; be open to learning new things, and thinking back to your early education and experience. There’s a reason you learned about linked lists, maps, arrays, etc at one point.
Would love to hear some feedback on the thought! Tweet me @chrisfauerbach.
DISCLOSURE STATEMENT: These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with, nor is it endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2018 Capital One.