HyperLogLog Algorithm Part I : Flajolet–Martin algorithm

Raghavan
3 min readJul 8, 2018

--

HyperLogLog Algorithm (Thanks to The Morning Paper)

HyperLogLog is a efficient algorithm that approximates distinct elements in multiset.It is being used in most of big data systems , including spark to compute the count of distinct elements.

The Problem

Lets define the problem , we have a multiset (can have multiple occurrences of same element) of elements and we wanted to compute the cardinality (Count of distinct elements) of the multiset.Computing the exact cardinality on a large data is hard , costly as we need to see every element of the multiset and compare them.

Most of the times we are fine to have the an approximation of the cardinality. HyperLogLog computes the approximation of with very less memory.

HyperLogLog is an extention on the LogLog Algorithm , which in itself is a derivation of Flajolet–Martin algorithm . Lets look at each of these algorithms and understand how it operates in parts . To keep the read short , we will see each of them as a independent stories. Lets start with the Flajolet–Martin algorithm .

The Flajolet–Martin algorithm

Assume our multiset as M , with elements e1 …… en

Lets say we are given a function hash(e) which maps elements to integers in uniformly distributed over range 0 to 2^L -1. And we also define a function bit(y,k) which returns k th bit in the binary number y , such that

y = bit(y,i) 2^i + bit(y,i-1) 2^i-1 + . . . . + bit(y,1) 2¹

Then we define a function p(y) which returns the least significant 1-bit of y , defined formally as :

p(y) = min of k such that bit(y,k) ≠O if y > O

For convenience we will assume

p(y) = L if y = 0

We also hold a hash called BITMAP of size L and initiate all the values to 0. Now we loop through all the hash(e) elements and we find the least significant 1-bit as x , then we mark that BITMAP[x] as 1.

If we assume that the number distinct numbers in the multiset is n , the we can estimate the approx number of times BITMAP[0] is accessed : n / 2 times (given that the hash function maps to evenly distributed numbers , then there number of even numbers should be close to 50% , hence n/2 ), and number of time BITMAP[1] is accessed : n / 4 . Extending this argument , we can generalize this to :

if i ≫ log2 ⁡ n , then BITMAP[i] is almost certainly 0, and

if i ≪ log 2 ⁡ n , then BITMAP[i] is almost certainly 1 .

Then if we find the smallest index R in BITMAP such that BITMAP[i] = 0 , then , we can estimate that cardinality of the multiset as 2^R .

In the next story let look at the clever improvement over Flajolet–Martin algorithm : the LogLog algorithm .

--

--

Raghavan

Data scientist at Ericsson AI Accelerator, Kravmaga Trainer