HyperLogLog (or how we estimate large numbers of unique URLs)

Fabien Marty
Botify Labs
Published in
5 min readFeb 1, 2024
AI hallucination with the prompt “illustration on HyperLogLog” ?!?! => It makes sense to write a blog post ;-)

A real-life use case @Botify

Let’s say we have dozens of text files, each containing several million URLs. These files are generated by some batch jobs and, once generated, can be considered “read-only”.

Now, we want to build an interactive service that returns the total number of unique URLs in a subset of these files (bearing in mind that there may of course be duplicate URLs between different files).

As this is an interactive service (with a real human waiting for the result!) it has to be very fast 🚀

Let’s zoom out a little bit

Before digging into this real-life example @Botify, let’s zoom out a little bit and try to consider a more general question: how to get the number of unique values (aka cardinality) within a huge dataset?

The “exact” way

To do that exactly, we have to scan the whole dataset and for each value:

  • keep in memory the value (or, better, a hash of this value) in an “already found” data structure with fast random access (a dict in Python, a map in Golang…)
  • check if the value is in this “already found” data structure

In the end, your result is the length of your “already found” data structure.

Cons:

  • this is a little bit CPU intensive* when the dataset is very large
    (~ 20s for 100 million values on my M1 laptop in Golang)
  • you need a significant amount of memory for your “already found” data structure
    (~ 20 bytes per value in the best case: memory-efficient language like Golang, hashed value…)

[*]: For each value [complexity: O(n)] you have to request your “already found” data structure [complexity: O(log n)]
=> final complexity ~ O(n.log n)

The “estimated” way

Now, imagine you can estimate the number of unique values with:

  • an error rate of ~ 1%
    (in most cases)
  • with much lower CPU usage
    (divided by ~ 5 in my basic test with 100 million values)
  • with incredibly less memory (a few KBytes only!)
    (divided by ~ 120 000!!! in my basic test with 100 million values)

Interesting! Isn’t it?

That’s the magic of the HyperLogLog probabilistic algorithm we’re going to talk about!

HyperLogLog: how can it simply work?

The algorithm is documented in the original paper from Philippe Flajolet a French mathematician, and its practical implementation and variants were covered in depth by a 2013 paper from Google.

Let’s try to demystify it (with huge simplifications)

Say we have a hash function to transform a value (a string in our example) into an integer (for example a 64 bits integer) and then into a binary representation of this integer (so an ordered list of 64 binary values: 0or 1):

The (very simplified) idea behind HyperLogLog is to count the longest series of leading zeros
(among all the binary representations of values whose cardinality we want to estimate)
.

For a single value, if the hash function is “perfect”, you would probably agree that the probability for the binary hash to start with 20 leading zeros is 1/2²⁰.

So, let’s reverse this sentence! If the longest series (among all our values) of leading zeros is 20, then we have probably seen 2²⁰ ~ 1 million unique values!

Ok, there are plenty of flaws in this example but this is the basic idea behind the HyperLogLog algorithm. If you want a progressive dive into it, I recommend this excellent blog post from Cheng-Wei Hu.

An example in Golang

A simplified example in Golang with the HyperLogLog library: https://github.com/clarkduvall/hyperloglog

Note: itemis a simple structure which implements Sum64() uint64 interface:

type item struct {
s string
}

func (i item) Sum64() uint64 {
// we use a standard xxhash here to compute a uint64 (from the string)
return xxhash.ChecksumString64(i.s)
}

Another interesting feature of HyperLogLog we are going to use in the next part: we can merge easily several HyperLogLog to get an estimate of the cardinality of the union of two sets!

hll1, _ := hyperloglog.NewPlus(14)
// ... we feed this first HyperLogLog
fmt.Println("estimated cardinality hll1:", hll1.Count())

hll2, _ := hyperloglog.NewPlus(14)
// ... we feed this second HyperLogLog
// ... (with some duplicates already present in hll1)
fmt.Println("estimated cardinality hll2:", hll2.Count())

// let's merge hll2 into hll1 to get the estimated cardinality of the union
// no need to interate both sets another time!
hll1.merge(hll2)
fmt.Println("global cardinality:", hll1.Count())

Back to the real-life example @Botify

If we go back to our real-life example at Botify described at the very beginning of this post:

Let’s say we have dozens of text files, each containing several million URLs. These files are generated by some batch jobs and, once generated, can be considered “read-only”.

Now, we want to build an interactive service that returns the total number of unique URLs in a subset of these files (bearing in mind that there may of course be duplicate URLs between different files).

Reading whole files and counting exactly is not an option as :

  • these files are too big (and would take too long to read for an interactive service!)
  • it would cost a lot of memory (to exclude duplicates)

With the help of HyperLogLog algorithm, here is the strategy we use:

  • First, we modify the batch process which builds our “read-only” files containing URLs to also compute a HyperLogLog data structure and dump it on disk, next to the file
  • Then, we introduce our low-latency API to estimate the cardinality
  • This service reads only HyperLogLog structures (a few KB) merges them, and returns the estimated count!

=> This is “fast enough” for interactive use! 💪

Note: the reality is a little more elegant but we prefer to sacrifice it for the sake of simplicity.

Conclusion

Depending on your use case (some of them can’t stand approximation!), HyperLogLog can be an incredibly useful thing in your toolbox!

HyperLogLog is particularly efficient at estimating the cardinality of large datasets (100 million URLs in our example):

  • with relatively high accuracy
    (~ 1% in our example)
  • while using a small amount of memory and CPU
    (~ 15KB of memory instead of 2 GB, CPU usage divided by 5 in our example)

It’s one of the probabilistic algorithms we regularly use at Botify as we deal with huge amounts of data.

We also use other ones like Bloom filters. Maybe a perfect subject for another blog post!

--

--