Datastructures for Analytics
If you know me, you probably heard me go on about data structures and anime. I’m a Software Engineer who works primarily on systems problems. It’s not rare to see some really cool data structures in action, but not as often as I’d like.
I am writing about a couple that caught my attention.
Cardinality using HyperLogLog (HLL)
Cardinality refers to the uniqueness of data values contained in a particular column (attribute) of a database table. The lower the cardinality, the more duplicated elements in a column.
For people familiar with database queries, it’s essentially a
SELECT COUNT(DISTINCT) query.
Some queries that can be answered by cardinality:
- Number of unique DNS requests from a given IP
- Number of unique visitors throughout the history for google.com
The naive approach of running the SQL on a non-distributed database will of course not scale. Let’s say we store the data in files across multiple boxes, e.g. HDFS, we could run a map-reduce job to count the number of unique entries. If you have ever used hadoop/spark, you know that it is not going to be realtime.
Let’s make the problem simpler. Let’s say we have a stream of integers coming in. At any given instant we wish to know how many unique numbers we have seen. It’s easy to see that in random data, a sequence of
k zero bits will occur once in every 2ᵏ elements, on average. All we need to do is look for these sequences of long trailing zeros and record the length of the longest sequence to estimate the total number of unique elements.
What if the first number I see has 32 trailing zeros?
To account for this we could hash each number of the input stream multiple times, count the trailing zeros and take the mean.
This approach has one nice property. For a
2^k bit number you only need
log2(k) bits to record the maximum number of traling zeros. Hence the name HyperLogLog. This idea forms the basis for HLLs.
Refer Further readings for more details.
Top K using Count-Min (CM) Sketch
“Sketching” data structures store a summary of a data set in situations where the whole data would be prohibitively costly to store
An example “sketching” data structure could be a Bloom Filter or even HLL — you’ll notice how little space HLLs were taking up .
Let’s talk briefly about Bloom Filters. Bloom Filter is essentially a set, that can say if an element is present with a chance for false positives.
Adding an element to a Bloom Filter:
- Have a bitset initially set to all
- Compute K hashes for the element:
v1 v2 v3 v4 v5 ... vk.
- Mark all the indexes at
To check for the membership of an element. You’d compute the hashes again and check if all of those bits are set. With the right hash functions and a high enough capacity for the bitset, one can greatly minimize the false positive percentage. They are used extensively in Cassandra.
CM works very much like a Bloom Filter. We have k hash maps, each with a different hash function (note that the Bloom filter can be implemented in this way as well). They map from the hash value of the element to the frequency. Initially we have all frequencies set to 0. When we encounter an element, we increase it’s value in all the k hash maps. If a decrease operation is allowed (which makes things more difficult), we similarly subtract a value from all k maps.
To obtain the count of an element, we take the minimum of the k fields that correspond to that element (as given by the hashes). This is why it’s called a Count-Min sketch.
If you want to play around with these data structures, I recommend looking at https://github.com/addthis/stream-lib, it was also interesting digging into the source code.
Coming up next:
- Percentile Summarization using T-Digest
- Any other recommendations?