Clearing the fog on HyperLogLog.

Kyle
6 min readOct 3, 2016

--

Understanding Redis HyperLogLog with Node.js and Angular.

This is part 24 of my Node / Redis series. The previous part was On Redis Geo and Node.js.

A couple years back when Salvatore introduced the HyperLogLog (HLL) functions to Redis, I was initially confused — the whole concept of HyperLogLog was new to me. When it comes to algorithms, often what we use has been well established for years— to some degree we are still standing on the shoulders of 1970s giants. HyperLogLog is not like this — it’s relatively recent — the paper that described the algorithm is from 2007. I think this recency and consequential unfamiliarity has made it’s extremely useful Redis commands (PFADD, PFCOUNT, and PFMERGE) undeservingly obscure.

When I finally sat down and started playing with the commands, it was a revelation. Akin to when I started to understand Bloom filters, the possibilities opened up. I had this reaction:

Okay — so now that you’re excited, let’s talk about what HyperLogLog does:

A HyperLogLog can tell you an estimate of the number of unique items it’s been supplied.

That’s it — now, you may say, what’s the big deal? You can do the same thing with SCARD and it’s 100% accurate. The difference lies in space efficiency. With a set and SCARD, the size of the set is roughly proportional to the number of items in the set (assuming every item is the same size). If you store 100 items that are 100 bytes each, the set will be roughly 10,000 bytes (Redis memory representation is far more complex than this, but this works for a basic understanding). This roughly linear relationship is expected.

Now, HyperLogLog, on the other hand, internally uses hashing and the random distribution of bits to mathematically estimate the number of unique items that you’ve put into it. How it works is fascinating, but you really don’t need to know it in and out to start using it. What you do need to know is:

  • HyperLogLog is space efficient — the maximum size of the data structure is about 12kb.
  • HyperLogLog is pretty quick.
  • HyperLogLog has a low error rate — 0.81%.
  • HyperLogLog is stored, internally, as a normal string in Redis. You can call string functions on the HyperLogLog (GET, SET, STRLEN) to evaluate and manipulate.

All of this seems abstract until you start using it. For my own learning on HyperLogLog, I set to visualize the structure and compared it to set/SCARD. So, I built the “HyperLogLog / Node / Redis Playground.”

Building the playground

The playground in action

This playground is built with Node.js and Redis (natch) with Angular.js (1.x) handling the interactivity. Let’s first start with the server —the API needed to be able to:

  • Simultaneously add a single item to both a set and to a HyperLogLog.
  • Get the current size of both the set and the HyperLogLog.
  • Simultaneously reset both the HyperLogLog and the set.

For simplicity sake, I made all the API commands always return the sizes or, alternately, just get the sizes. Here is how the routes will look

  • POST /item Accept a string of any size in a JSON object (at “value”). Return the PFCOUNT and SCARD as well as the STRLEN of the HyperLogLog and a running total of number of bytes passed into the set.
  • GET /items Return the PFCOUNT and SCARD as well as the STRLEN of the HyperLogLog and a running total of number of bytes passed into the set.
  • DELETE /items Reset the HyperLogLog and set (DEL). Return the PFCOUNT and SCARD as well as the STRLEN of the HyperLogLog and a running total of number of bytes passed into the set.

As you might notice, since it’s always returning the same information, I’m using the same code — basically if Redis returns anything but an error it runs the same function after every request (storageInfo).

To serve the static assets, the server also uses the static middleware.

A couple of caveats:

  • The running total of bytes in the set is managed on the app level — so if you SADD extra items through redis-cli, those bytes won’t be counted.
  • The storage info is not atomic for code simplicity.

The client side is a short Angular.js app. If you’re not familiar with Angular, it’s great for putting together short Single Page Applications like this. It uses a template that is HTML with a few extra goodies. It consists of two static files:

  • index.html The HTML file that calls all the Angular assets and contains our only view.
  • hllplayground.js The javascript file that contains our angular bootstrap code and our controller.

Angular.js is being called from a CDN as is chart.js, a nice Javascript charting library. Chart.js is being Angular-fied by the angular-chart module.

In the controller, I’m using $http to make run-of-the-mill AJAX requests to get data from the server.

Using the playground

When you start the playground you’ll see two graphs. On the left is the history and the right is the current state of the set and HyperLogLog. The history graph will be empty as it managed for just the browser session.

Under the graphs is a tabular representation of the data. Below the table is a form where you can enter any value you’d like to store both in the set and in the HyperLogLog as well as a few buttons. Finally below the form is a history for the session.

To start out, just start throwing strings into the text field and click “submit.” After submitting you’ll see the graphs instantly change. On an empty HLL and set, add a single character. You’ll see the size of the HLL jump — this is the overhead of the structure as well as the bits being stored. The set (represented by the lighter bar), at this point will be smaller, but the count will be the same (no estimation errors yet).

Entering 1, 2, 3, 4 into the playground

Now, let’s click the “Random” button a few times. This is taking a javascript random number (Math.random) and submitting it to the server.

Clicking the “Random” button 6 times.

Hitting the random button six time (which inserted six, ~18-bytes strings) the set has really grown and surpassed the storage taken up by the HyperLogLog. Looking at the other graph you can see the linear nature of the set/SCARD method overtaking the HLL.

After 4 single byte items and 6 18-byte items

While the HyperLogLog is growing, it isn’t totally relational to the size of the items being inserted, and, in fact, will eventually cap out at around 12kb (technically it is 12kb now, but automatically run-length encoded).

Let’s enter some bigger text. I went and got some Lorem Ipsum — 424 bytes worth. Look how everything grew with the set:

Holy moly!

Recall earlier that HyperLogLogs can only estimate the number of unique items, right? Well, so far we haven’t seen anything — the counts are still the same. On the playground table, I’ve made the HyperLogLog count cell text turn red when the count is not the same as the SCARD count. This can take a while, so I suggest clicking on the random button like crazy. I got a mis-match after 75 random items, but it really depends on your data (and luck if it’s random).

HyperLogLog missing a count.

I want to play!

You can go and check out the playground source on github. You’ll need to grab the repo and run npm install. If you don’t have Redis installed on your machine, it would be a great time to check out RedisLabs’ free 30mb Redis Cloud instance. Put your connection details into a NodeRedis connection options object and save it as a JSON file (outside of the project’s directory!). Launch the server by running:

node index.js --connection path-to-your-connection.json

You should shortly see “server ready” in your terminal — go ahead and point your browser at http://localhost:8012/ and have fun!

--

--

Kyle

Developer of things. Node.js + all the frontend jazz. Also, not from Stockholm, don’t do UX. Long story.