Privacy Friendly Histogram: the pros and cons of not knowing the user

Published in

Incognia Tech Blog

9 min readFeb 4, 2020

Source: https://towardsdatascience.com/set-theory-basic-notation-da93c3d48090

Gathering intelligence from data is one of the main byproducts of our technological era. However, it may come with the price of dealing with sensible data from the user.

Some simple and handy examples of such intelligence are frequency histograms. For example, suppose you are a business owner and want to know the level of recurrence of your customers, i.e., how many times does a customer come to your business in a week. An effective way to display such information is to create a histogram of customers’ visits as shown in the figure below.

The process of computing a frequency histogram. The histogram represents the information of how many customers went to the place a determined number of times.

Computing these histograms from large amounts of data requires a lot of computational power and memory. We need to keep track of a lot of entries to compare and aggregate. Also, from the point of view of privacy, we may encounter a problem, since we need some kind of identifier of the entries.

We’ve come up with a simple but effective idea to perform such task both in a fast and privacy-friendly way. But first, let’s talk about a keystone in such idea: HyperLogLog.

The HyperLogLog

HyperLogLog (or HLL for the close friends) is a probabilistic data structure introduced by the french computer scientist Phillipe Flajolet. The idea of such a structure is to efficiently gather information about a set (a collection of things). With it, we can compute the cardinality (number of things without repetition) of the set and perform unions with other HyperLogLogs.

For example, suppose that you plan to go to some of your friend’s house and play games. You have a pretty large set (in the order of millions) of games, your friend also have a lot of them and you two will play together. Since, of course, the plan was to play all the games you both had, you want to know previously how many unique games you both together have, so you could allocate the right time to arrive at your friend’s house and still be able to play it all before you need to go (since you have to be at home in time for dinner). HyperLogLog can save you some time in this process. It is just a matter of building the structures of each of your sets, so the cardinality estimation of the union of your sets of games is almost instantaneous!

Because of these properties, this structure can be (and is) used for tracking large sets of data. However, another useful and perhaps accidental property is that it makes data pseudo-anonymous. In the process of creating the structure, elements are converted to the number of leading zeros in the binary representation of their hashes, and only the maximum value of those are saved. Therefore,it is not easy (and a lot of times not even possible) to backtrack those elements.

However, the intuition behind doing this is not simple, so an example helps a lot. The following example is totally derived from the blog post of Alessandro Nadalin (and figures totally inspired in xkcd, check it!).

LogLogParty

Imagine that you were invited to the great LogLogParty. There is a rule: if you meet someone new, you must ask their phone number. So you, trying to keep track of how many new people you met, decide to think of strategies to do so. You come up with a list:

First Strategy: just write down the names

The first strategy is pretty straight-forward, you just need to keep a list of all the people (or phone numbers of people) you met, remove duplicates (assuming you met so many people that actually “met” some twice by mistake) and bam: you have your new friends count.

The problem with such an approach is that you are going to have a lot of headache if you meet too many people. Removing duplicates and counting becomes a pain in the ass.

Second Strategy: count the number of leading zeros

The second strategy is a little more intelligent when it comes to making a lot of new friends. You, instead of keeping track of all the names of all the new people you have met, decided to count the number of leading zeros in each person’s phone number and only save the largest you’ve seen! At first look, this strategy seems terrible. However, let’s think about it.

If you met someone who has five leading zeros in their phone number, this is an event so unlikely that you must have met, on average, 100.000 people (assuming the erroneous but useful hypothesis that phone numbers are uniformly distributed and do not depend on the person who possesses it). So that’s the new way to count: just keep track of the maximum number of leading zeros you got, and, at the end of the party, compute 10 to the power of this number.

Of course, the problems with this method arise immediately: you can only count in powers of ten, and, worse, you can be (un)lucky that, in your first new friend, you met someone with, let’s say, six leading zeros and automatically have 1 million more friends, instead of one.

I mean, if you were an assiduous partier, in the average, you would count the right number of friends!

Third Strategy: making the second strategy better

The problems with the second strategy involved poor precision and high variance of the result. Well, there is a simple way to solve that (and it was spoiled in the very second strategy): do it a lot of times!

But keep calm! You do not need to attend a lot of parties. Just change a little part of your strategy. Instead of keeping track of a single number of zeros, let’s keep track of the number of zeroes in each bucket! A bucket is a recipient that contains phone numbers that have some kind of similarity. For example, we could have buckets representing the number on the left of the phone number. So numbers starting with 9 belong to bucket 9, and so on. Thinking this way, when you meet someone new, the first thing to do is go to their bucket. There, you keep track of the maximum number of leading zeros in the rest of the number (excluding the bucket part). For such, you now have a lot of buckets, each one containing the maximum number of leading zeros of a number that you’ve seen that belongs to that bucket.

The good side in this strategy is that it takes away a great part of the luck factor, since you would have been very lucky to be lucky in each of the buckets. And also provides a new batch of possible values for cardinality, since we are going to take the average of the counting in each bucket.

Your new friend: HyperLogLog

You were just introduced to HyperLogLog. Swap people for elements, phone numbers for hashes, calculate some fancy mathematical bias correction, and you can count cardinalities of large sets, with low memory and processing, with the cost of a little bit of imprecision. Of course, this example fails to comprehend all the components of HyperLogLog. There is a lot of work in both technical and academic perspectives to make it better and more precise.

What about the histograms?

As I said, we developed a way of computing those histograms that we talked about. The idea is to use HyperLogLogs (kinda) to do so.

The idea behind the strategy is that the number of leading zeros in a hash, the value that we track in an HLL, can be very representative of a single element. Let’s say an element has ten leading zeros in its binary representation’s hash, and this is such a rare event (given a reasonable number of users) that we can almost for sure say that if we see ten leading zeros, it, almost certainly, is that element.

So, the number of leading zeros is a quasi-ID of a user, and the buckets act like a subsampling of all population. Why not do the histogram of the leading zeros? That’s precisely it.

Imagine that we have computed HLLs for each day of the week. If we look in a specific bucket in day one and see six leading zeros saved and look in day two and see the same six leading zeros count, it’s probably the same person (could have been just luck and were two different people, but this is unlikely). So we know that this person has come to this place on two separate days. Therefore, we compute how many times we see some number leading zeros for each bucket on the entire week and voilà: histogram without knowing the user!

An important factor that we need to take in consideration is that the histograms outputted by the algorithm are normalized, since looking at the maximum number of leading zeros is equivalent to sub sampling the population. However, going back to the real histogram is just a question of multiplying each of its entries by the population size, that can be estimated with the very HyperLogLogs.

Actually not that simple

In theory, the strategy is excellent, in practice, not so good (mostly because I lied in the theoretical part). Assuming that the number of leading zeros in a bucket uniquely defines a user is valid for the majority of cases, but not all, and such cases are not negligible. Precisely those cases where this is not true make the histogram look skewed. The figure below shows it.

These are two histograms based on the same data (50.000 elements with frequency uniformly distributed in 10 days in a span of a month). The blue one is calculated with the exact method and the orange is calculated as described with HLL (with precision of 10). The orange histogram is skewed to the right.

The solution: MinCodeSample

Since we could not compute, with precision, the histograms associated with a HyperLogLog set, a new approach was taken: to save a part of the hash of the user with the maximum number of leading zeroes. In such a way, it’s possible to avoid collision with (a lot) more probability. Being able to avoid collision implies that computing the histogram is simply a matter of randomly subsampling the population, and, that, statistics guarantees!

MinCodeSample does it. Its structure is exactly like HyperLogLog, except that it saves part of the hashes in each bucket (in fact, there is a structure just like this called HyperMinHash, although not used for histograms). In this way, we can achieve a lot more precision, with the cost of saving part of the information that identifies a single user, that is, with the cost of less privacy. This trade-off comes from the fact that, the longer the part of the hash that we save, the easier it becomes to identify the user that generated that hash. At the same time, the better and more precise the histogram calculation becomes, since we become even more sure of the uniqueness of this sample.

The same data as in the previous figure. Blue is the exact computation, and orange is the estimation based on MinCodeSample (saving the full hash of elements). The results are a lot more similar.

Conclusion

Although we are aware that cardinality estimators do not preserve privacy, we understand that the MinCodeSample represents a step towards keeping our user’s data safer. This safety comes from the fact that:

Users’ ID are hashed with salt.
Only the users with the maximum number of leading zeros on the hash representation in each bucket are (possibly) traceable.
Raw data is not used to perform computations.

Therefore, we see this structure with kind eyes and we are eager to improve it. Maybe you can help us.

Are you interested?

If you are interested in building context-aware products using a powerful location technology that genuinely cares about user’s privacy, then check out our opportunities.