# HyperLogLog — Count once, Count fast

### Background

A couple of days back, I wrote my first post on Medium. I liked the novel feature of Medium ‘Stats’ which displays the number of views, reads, and also it’s graphical representation. One of the thoughts which struck my mind was whether I could increase my views by just clicking F5 button. I was being naive at first and knew Medium would have already solved this problem. As I contemplated, I could see more use cases with other websites, for instance Youtube displaying number of unique viewers, Google Analytics showing unique searches for a given month, an E-commerce website presenting number of visitors, and many more. This awakened the programmer inside me and I could come up with the below solution to find number of unique visitors of any site.

#### Simple & Quick Solution

At the top of my find, to count the number of unique entries the most efficient data structure would be a Set and it’s size would provide the unique count of entries. The following code snippet would keep track of and give count of unique visitors

#### Mission accomplished yet ?

Initially, after writing the above code, my reaction was that of this famous success kid. However, it can be observed that I am storing a new visitor in memory every time a new visitor visits the page (New users can be identified on the basis of ip address/sessions, etc). What if the website works on the scale of Amazon, Facebook, Google or Netflix which has millions of users visiting their pages, will the same solution work ?

My solution needs memory space proportional to the number of users. Lets do some quick calculations assuming we have 100 million visitor.

#### Time for some math

Space for Visitor object = 100 B

No of visitors = 100 million

Total space required = 100 million * 100 B ≈ 9.31 GB

This means we need approximately 9.31 GB of space to get the count of unique visitors. So the algorithm doesn’t scale and is inefficient as a lot of space is wasted. I did a google search on how big Tech companies have tackled this problem and surprisingly it’s one of the most renowned problems in Computer Science named as Count Distinct problem. I was introduced to the HyperLogLog algorithm which efficiently calculates the number of distinct elements in a set or cardinality of a set.

### HyperLogLog

For an Internet based Company dealing with millions of users, its perfectly alright to not show the accurate count of unique visitors but display the number with some degree of error (less than 3 %). As we saw from previous example, the cost increases as we increase the scale, if the number of visitors grew at a 2X (200 million) rate memory cost would proportionately rise by 2X (18 GB). Of course, we don’t have an option to reduce the scale as we can’t tell the users to not visit the website, but if we compromise on the accuracy of the numbers shown, we are still good.

HyperLogLog algorithm solves the problem of finding cardinality of a massive dataset using less than 1.5KB of memory and with an error estimate of less than 2 %.

Fundamental Working Principle

Before doing a deep dive into the internal workings of HyperLogLog, lets say you are playing a game with your friend where you flip a coin multiple times & simultaneously record the number of heads you get in a row. Later, you ask your friend to estimate the number of tosses needed. As the number of consecutive heads increase, the corresponding number of tosses will proportionately increase. Similarly, if the number of heads in a row is less, the count of tosses will also be less.

The probability of getting n number of heads in a row is 1/(2^n). So, there is a 1 chance in 2^n tosses of getting n heads in a row. In other words, for n heads in row, total number of tosses can be approximately equal to 2^n.

The same principle is used in HyperLogLog, the only difference being Heads, and Tails replaced by 1s and 0s. The algorithm uses a hash function(uniformly distributed output) to hash an input and keeps track of the leading zeroes in the output. If the maximum number of leading zeroes is k, estimate of the cardinality is 2^k.

Lets revisit our previous example of computing number of unique visitors on a website. For simplicity, lets use the user’s ip address as an identifier for the user. This ip address is passed to a Hash function which outputs a 32 bit integer. If the maximum number of leading zeroes is 20 (k), then the total number of unique visitor becomes 2²⁰ (2^k).

Following are the two disadvantages to the above approach :-

1. The count will always be a power of 2 number. So the valid values can be {1,2,4,8,16,32,64,128,256,512,1024, etc }. There wouldn’t be any value in between these numbers. It results in poor accuracy
2. Variability is high in this method. If we have only one visitor and luckily lets say we get 10 consecutive zeroes, our prediction will go wrong in this case

#### Improving Accuracy & Reducing variance

In the above approach, instead of relying on a single hash function, we can make use of multiple hash functions to reduce the variance. For every hash function, we compute the leading number of zeroes and later find the arithmetic mean, and use this mean to get the cardinality. For example, if we used 10 hash functions, 1 of the hash function might have output with 20 leading zeroes whereas other 9 hash functions may have one leading zero. In this case, the average will be (20*1+1*9)/10 =2.9 and the cardinality 2^(2.9), still better than 2²⁰.

An alternative to having multiple hash functions is to harness the workaround proposed by Durand and Flajolet. In this approach, the output of the hash function is segregated into buckets based on the leading bits. The buckets keep track of maximum count of trailing consecutive zeroes encountered. Every time a number is assigned to a bucket, if the count of trailing zeroes is more than the stored count in the bucket, the bucket is updated. Further, harmonic mean of all the buckets is calculated and eventually its used in calculating the cardinality.

Let’s say we use a Hashing function and its output is (1011000010000100100). We will use the first 4 bits of this output to determine the bucket. The first 4 bits are 1011 (11 in decimal), so the bucket number is 11. Further, we calculate maximum number of zeroes starting from the right most bit, which in this example is 3. We’ll update the bucket 11 with number 3. Subsequently, if we encounter more consecutive zeroes from the right most bit for the same bucket, we’ll update the bucket. Following is an illustration :-

In the above diagram, the count of zeroes from left is 5 which is more than the stored count (3). So, we update the bucket 11 with new value i.e 5.

The formula to calculate the cardinality is :-

The constant in the above equation is 0.7942 to correct the bias. m is the number of buckets. The third term is the harmonic mean and Rjs denote the count of maximum zeroes from the leftmost bit.

The error is estimated to be 1.05/(√m), where m is the number of buckets.When m=2048, the error estimate reduces to 2.32%. For a population of 1 million, our estimate will lie anywhere between 976798and 1023201.

#### Practical Implementation of HLL in java

In the below code snippet, I have used Murmur3_32 hashing function provided by Google’s guava java library. I iterate on million integers and calculate hash for each of them. Since the output is a 32 bit integer, the first 5 bits have been used to allocate the hash to a bucket. We have 2⁵ = 32 buckets and each bucket holds the maximum consecutive zeroes from the rightmost bit.

After calculating the bucket which a hash belongs to, we get the maximum trailing zeroes for that hash & update the bucket if the maximum trailing zeroes for the hash is more than the one stored in the bucket (Line19–23).

We use the formula for cardinality mentioned in the previous section to estimate the number of elements.