The name is Log, HyperLogLog
I’ve finally started reading research papers, and I picked something way out of my league. Yes, I chose the research paper on HyperLogLog. I’ll be honest, my math skills aren’t advanced enough to understand everything in it. But, in this post, I’ll decode what I’ve understood and look at how we can implement it, maybe in JavaScript.
The Gist
HyperLogLog is an awesome algorithm that counts the distinct elements (a.k.a. the cardinality) of very large data streams. In other words, it’s a cardinality estimator.
Why an estimator?
Well, for obvious reasons, if you need an exact count, you’d need to load that big data stream (let’s call it M) into memory to count. And boy oh boy, wouldn’t that take a long time to load and to count? Plus, can we even store it all in memory in the first place?
Diving in a Little Bit…
Well, if I’ve managed to have your attention and not scare you away, why don’t we just go a bit further? I won’t deep dive (because I can’t at the moment), but let’s explore this algorithm.
The Algorithm
The following is the algorithm given in the official whitepaper:
Imagine you have a giant jar of jellybeans, and you want to know how many different colors of jellybeans are in the jar. Counting each jellybean one by one would take forever. Instead, HyperLogLog uses a clever method to give you a good estimate much more quickly. We’ll first break down the algorithm and then map the steps to the mathematical notions above.
Here’s how it works:
- Assigning Codes to Jellybeans: Each jellybean gets a unique code, like a barcode. This code helps us identify it without needing to look at every single jellybean.
- Creating Bins: We have several empty containers (bins) to sort these jelly beans into. Each bin will help us keep track of the codes we’ve seen.
- Sorting into Bins: We use the first few digits of each code to decide which bin it goes into.
- Counting special markers: Once a jellybean is in a bin, we count how far we need to go in its code to find a specific marker (like the number ‘1’). This count tells us something unique about the jellybean’s code.
- Combining Bin Information: After sorting all the jellybeans and updating the bins, we combine the information from all the bins. This combined information gives us an overall picture.
- Estimating the Total: Finally, we use a special formula to turn this combined information into an estimate of how many different colors of jellybeans are in the jar. This estimate is not exact but is usually very close to the actual number.
Okay, now that we have explained the basics of the algorithm, let’s see what the symbols and mathematical notions mean as per our Jellybeans analogy:
Well, that was too much to process! But that’s it.
What’s with the weird name?
I was also intrigued by the name HyperLogLog, but when I completed my first round of reading, the name couldn’t be more perfect. Let me tell you what it is:
LogLog
In mathematics, the logarithm (log) is a function that helps us work with very large numbers by transforming them into smaller, more manageable numbers. For example, the log to the base 10 of 100 is 2 because 10² = 100.
The “LogLog” part of the name refers to the use of the logarithm function twice. Using logs twice allows an algorithm to handle very large numbers while keeping the computations manageable. It is to be noted that the HyperLogLog algorithm is an improvement to its predecessor algorithm “LogLog”.
That brings us to -
Hyper
This prefix indicates an improvement or enhancement to a previous method and HyperLogLog is an improvement over a previous algorithm called “LogLog” introduced by Philippe Flajolet and G. Nigel Martin in 2003.
The HyperLogLog enhances LogLog by increasing the number of registers, using harmonic means, applying bias corrections, and allowing for distributed computing scenarios.
Now that I’ve mentioned how HyperLogLog helps in distributed computing, I think you might also like how Facebook utilized HyperLogLog in its Presto db. Give it a read here.
Enough talk, even I’m too exhausted. Let’s attempt to implement this in Javascript.
Step 1: Hashing Function
Most implementations out there use the Murmer Hash 3. Since we do not have a large data set, and from my experimentations it seems that using SHA256 provided a more uniform distribution. However, you are welcome to experiment with this on your own.
const crypto = require("crypto");
function hash(input) {
return crypto.createHash("sha256").update(input).digest("hex").slice(0, 16); // Ensure 128-bit hex string
}
Step 2: Initialize Registers
Next, we initialize the registers. The number of registers (m) should be a power of 2. We’ll set them all to zero initially. Using 16,384 registers (𝑚=16384) provided a more closed cardinality number for me since I have 100000 unique elements in my array.
const m = 16384; // Number of registers, should be a power of 2
const registers = new Array(m).fill(0);
Step 3: Update Registers
We update the registers based on the hashed values. For each element, determine which register it belongs to and update the register with the maximum position of the leading ‘1’.
function updateRegisters(value) {
const hashValue = hash(value);
const j = parseInt(hashValue.slice(0, Math.log2(m)), 2);
const w = hashValue.slice(Math.log2(m));
const leadingZeroes = w.search('1') + 1;
registers[j] = Math.max(registers[j], leadingZeroes);
}
Step 4: Calculate Raw Estimate (Harmonic mean)
Calculate the raw estimate using the harmonic mean of the register values.
function calculateRawEstimate() {
let sum = 0;
for (let i = 0; i < m; i++) {
sum += Math.pow(2, -registers[i]);
}
return (1 / sum) * m * m;
}
Step 5: Bias Correction
Apply a correction factor to the raw estimate. This correction factor can be empirically determined. For simplicity, we’ll use a precomputed value (Refer to AlphaM value in the HyperLogLog whitepaper, Fig 3).
const alphaM = 0.7213 / (1 + 1.079 / m);
function biasCorrection(rawEstimate) {
return alphaM * rawEstimate;
}
Step 6: Final Estimate
Combine the steps to get the final cardinality estimate.
function estimateCardinality() {
const rawEstimate = calculateRawEstimate();
return biasCorrection(rawEstimate);
}
Test Drive
Now that we have our building blocks, let’s put our code into a test drive (Here’s the final code in case you faced any difficulties).
const largeData = Array.from({ length: 100000 }, (_, i) => `item${i}`);
largeData.forEach(item => updateRegisters(item));
console.log("Estimated Cardinality for Large Dataset:", estimateCardinality());
Result
Not bad I’d say, again this is an estimation and we could even get a closer value had we had a bigger dataset and fixed the value of m properly.
Conclusion
I spent nearly 1.5 to 2 weeks just to grasp the fundamentals of HyperLogLog. This is just the beginning, and I know it’s something one would improve on with consistency and practice.
HyperLogLog is used in the detection of Denial of Service attacks, estimating the number of unique active users on platforms like Facebook, and more. It’s a powerful tool for dealing with big data, offering a balance between accuracy and memory efficiency. Whether you’re analyzing network traffic, processing log files, or managing large-scale databases, HyperLogLog is an invaluable algorithm for counting unique elements in massive datasets.