The Wonderful World of HyperLogLog

Kyle
8 min readOct 28, 2016

--

This is part 25 of my Node / Redis series. The previous part was Clearing the fog on HyperLogLog.

Last time I introduced HyperLogLog and we were playing around with the command set in Redis. We built a playground application that could be used to visualize how HyperLogLog works. If you haven’t read that article, go back and read it now — it’ll help. I’ll wait…

💤

Ok, let’s talk about a practical application of HyperLogLog! Let’s say you’re building an app where a users can “check-in” to a location. Think Swarm or Facebook places. You have a location and users can indicate that they’ve been to the location. For each user, you’ll be recording the places they’ve checked into in Redis using a set. It could look something like this:

user-locations:userid location-a-Id

Adding a visit to a user would be a single command:

SADD user-locations:userid current-location-Id

It’s quick and easy to see if a user has been to a location before:

SISMEMBER user-locations:userid current-location-Id

And if you want to know what location’s they’ve been to, you can issue a SMEMBERS command:

SMEMBERS user-locations:userid

This is all pretty manageable — each user is going to have a relatively small number of locations they’ve visited. I’m guessing that even your top users will only have in the thousands, but most will have a few dozen.

Now, on the other side, how do you deal with the users that have visited a location? Let’s assume we use a similar setup for the locations as we do for the users — a Redis set:

location:a-location-Id { a-user-id, another-user-id, … }

We could then find the number of unique users by doing a SCARD command

SCARD location:a-location-Id

You could check to see if you’ve checked into the location by running SISMEMBER:

SISMEMBER location:a-location-Id my-user-id

For many locations, the number of unique visitors will be small— the local pizza joint might rack up a thousand unique visitors. But what happens if you have a popular location?

Looking at Walt Disney World on Facebook, I’m currently seeing over 5.4 million unique visitors that have checked in. If we store all the visitors in a set for this location, we could be in trouble.

Let’s do the math to see how much memory Walt Disney World’s set would take up. To get a good number, let’s flesh out our data. Just for an example, let’s say we’re storing our user IDs as 32-character hexadecimal (a la MD5). We’ll also make the assumption that each character takes up a byte of memory when stored in a set (a little naïve, and may be on the small side, but stay with me).

5,400,000 * 32 = 172,800,000 (~173 mb)

That’s a lot of in-memory storage for a single location. What happens if you have thousands of locations? Millions? Not to mention that as your service grows, you’re going to increase the storage as 1) your user base grow, 2) number of locations grow and 3) your number of check-ins increase. Scaling will become unmanageable.

As you might have guessed, we can ease this problem with HyperLogLog. Taking stock of the situation when it comes to the data about checking in, you really only need to know the following things:

  • Have I checked into this location before?
  • How many unique users have checked into this location previously?

The first question can be answered with just the user’s own data. The second question we can answer with PFCOUNT. Now, it comes with caveats — first, you won’t be able to answer “Who has been to this location?” but you’ll likely want to this information somewhere (maybe not in Redis, given the size of the set at scale), but it’s certainly not needed to render the page. Secondly, you have to be aware that the answer that PFCOUNT supplies is only an estimate. It’s going to be close, but it may not be right on the money. Since this figure is likely not going to be used for anything but a casual “look how many people have checked in here!”

Despite the above mentioned caveats we’re gaining some huge benefits. The maximum size that the HyperLogLog can take up is 12kb. All of your check-in information will grow at a nice, linear rate to the number of locations. If you have 1,000 locations, the sum of your location data will be 12 mb. 10,000? 120 mb. It doesn’t matter if every one of those 10,000 locations are a Walt Disney World-scale number of check-ins, it’ll be no more than 120mb for 10,000 locations.

Now, let’s say that you have a few locations — Walt Disney World, Epcot Center, Magic Kingdom, Hollywood Studios, and Animal Kingdom. In this example, the latter four all make up Walt Disney World. With sets, this could become a massive computational undertaking to use SUNION. PFCOUNT allows you to supply multiple arguments to get an on-the-fly unique count across multiple HyperLogLogs. The performance isn’t as good as a single key, but it’s better than SUNION, especially at scale. In the same vein if you wanted to remove the comprising parks and merge everything into one park, you can call PFMERGE to combine all the locations together without sacrificing or double counting anyone. Handy!

Check-dis

Check-in +redis… get it?

Now, let’s take that idea and make it into the start of a web app. We’re going to build it using the same basic stack as the HyperLogLog playground: Node.js, Express, Angular.js and Redis (the NEAR stack). We’re not going to get into the weeds of authentication, so we can simulate part of that by having each session login- as long as you don’t reload the page it will hold on to your “user.” The basic functions here are:

  • List the locations
  • Check-in to a location
  • View the count of a single location

Just for fun, let’s add in some other features that will help illustrate what you can do with HyperLogLog.

  • Send random “check-ins” with fake email addresses in bulk
  • Check out the combined total of several locations

On the server-side, I’ll try and keep it as minimal as possible.

  • PUT /location/[location-key] check-in to a location, with the user’s email in the body of the request, return the current number of checkins
  • GET /location/[location-key] return the current number of checkins
  • GET /locations return a list of all the locations available
  • GET /multi-location/[location-key+location-key2+…] get the total of the locations listed in the parameter.

The user’s email is hashed into MD5 — this will be our emulation of a unique user ID field. We’re also only allowing users to check-in to certain locations by checking to see if a location is a member of a set. You’ll need to add some locations (thematically, I’m using Disney parks):

SADD locations animal-kingdom magic-kingdom walt-disney-world epcot

Grab the script from github and run:

npm install

I’m using the middleware pattern pretty extensively in this script. If you’re not familiar, it’s basically an intermediary step between a route and sending a HTTP response — this step can modify or add to data about the request or, alternately, it can reject a route and send an error. This pattern is especially useful in Redis as it’s friendly to asynchronous calls and Redis is so fast that you can easily do even a hand full of round-trips without getting outside of the acceptable range for response time.

To get the script started, run:

node index.js --credentials ../your-credentials-file.json

The your-credentials-file.json is a node_redis connection object stored as a JSON file. Make sure you store this file in a safe place (outside of your project directory — no one wants credentials being posted to a repo accidentally!). Of course, aside from node you’ll need Redis installed on your dev machine. Alternately, you can head on over to RedisLabs to get a free 30mb Redis Cloud instance.

Once running, you’ll see the “app started” message in your terminal. You can point your browser at http://localhost:8013/ . You should see something like this:

If you click the “Fake Email” button you can generate fake but valid emails (and sometimes entertaining). We’re using the faker library to do this magic. Put your own email or generate a fake one and click “Login.” You should see the dashboard:

Call me Mr. Kiehn

Click on one of the locations on the right hand part of the screen — this brings up where you can check-in to that location.

Click the “Check-in” button — notice that the check-ins will increment. You’ll also see a status below that indicates you’ve checked-in. If you click it multiple times, you still get the notice that you’ve checked in but the number won’t increment (as expected).

Ooooh — 6518 to 6520! I captured a the 0.81% chance of an count error!

Now, you might have noticed that the number of check-ins is 6520. No, I haven’t spent hours checking-in random users. I used the random check-in buttons— these send a random email addresses to the server in bulk. They are sent all at one time. Let’s see how this works.

Now, you might notice these check-ins were sent all at once (100 Requests) but they are coming back one at a time. This actually is illustrating a few different things:

  • Browsers only have a limited number of ports available
  • Node is single thread
  • Redis is single thread.

You can see how the requests go out and responses flow in with Chrome’s DevTools:

Checkout clicking the “Random Check-ins (100)” 5–6x in a row and watch how the data flows in neatly. It’s an interesting way to visually see how load occurs and the server deals with the strain (although maybe not the most scientific).

You can also play with getting the totals from multiple locations (:ahem: HyperLogLogs).

It’s even pretty quick though PFCOUNT with multiple keys is considered a slow HyperLogLog operation.

While this demonstration code is far from a complete app, you can see a few techniques:

  • Redis HyperLogLog to count big, unique values
  • Calling Redis in Express middlewares
  • Redis’ PFCOUNT command to combine the unique counts over multiple hyperloglogs.

--

--

Kyle

Developer of things. Node.js + all the frontend jazz. Also, not from Stockholm, don’t do UX. Long story.