DataSketches for Fast Computation

Noah Stebbins
Xandr-Tech
Published in
8 min readMay 12, 2020

Joint work with Lei Hu, Chinmay Nerurkar, Rebecca Conneely, and Moussa Taifi

Photo by Dhiva Krishna on Unsplash

Your name is Magnus and you sell cool cars, specifically the coolest cars. Your business is booming locally as swathes of budding drivers in your town purchase your cars. Happy but ultimately unsatisfied at your early success, you decide to expand. Convinced that there are infinite prospective customers just out of reach, you begin to advertise.

You.

You create some swanky car ads and enter the world of real-time bidding (RTB). But you are quickly overwhelmed by millions of publishers on which you can serve your ads. Paralyzed by so much choice, you call me, Noah, the ad guru to help you figure out which publishers you should bid on to display your car ads.

Your woes.

Each publisher presents you with opportunities in the form of impressions. Each impression roughly corresponds to a user viewing a publisher’s content on the web and it is up to us to figure out which of these impressions to bid on. So you ask me,

Which impressions should I bid on?

which is really the same thing as asking,

Which people browsing the web right now want my dope cars?

As a motivating example, take two potential customers: Lil Tike and Henry F.

Adtech in one image.

Lil Tike is a kid visiting kidz.com while Henry F. is an adult browsing carloverz.com. Each is represented in the form of impressions and you, Magnus the Courageous, have to decide whether you want to bid on Lil Tike, Henry F., or both.

Lil Tike is not old enough to drive so it is a safe bet he won’t buy your car. Henry F., on the contrary, is an adult who has expressed a clear interest in “loving carz”. Thus, you should probably bid on Henry F. and not Lil Tike.

So, great! We’ve answered the question, “Which impressions should I bid on?” for the simple two impression cases above. But how do we answer this question for millions of impressions?

Impressions Intensify

You have cracked the case for painstakingly judging the potential of incoming impressions by hand, but is there a faster way? A… [wait for it]… scalable way?

Photo by i yunmai on Unsplash

For simplicity, let’s call the act of someone buying one of your cars after viewing one of your ads a conversion. Recall that every incoming impression has a corresponding website to which it belongs. In the motivating example above, the two websites were kidz.com and carloverz.com.

In general, conversions are sparse — intuitively, most people that see a car ad won’t buy a car. While this varies from industry to industry, display ads generally yield conversion and click-through rates below 1%. But a small subset of them will and these conversions will amount to considerable piles of cheddar for your business.

So, here’s one way we can frame a solution to the question, “Which impressions should I bid on?”

Conversions are rare, so I have to leverage the conversions that already exist to the best of my ability! One way to do that for an incoming impression is to make a list of websites that are similar to the impression’s website that have already had conversions. I can then take some kind of average of those conversion rates in order to estimate what the conversion rate of the incoming impression will be.

Baked within this hypothesis is the following assumption…

Websites with high user overlap will probably have similar conversion rates.

… and the concept of website similarity to which I now turn.

Similarity à la Jaccard

For quantifying user similarity, is there a measure you could leverage that would output a higher value if two websites have similar users and a lower one if two websites are dissimilar?

Yes! One such measure is the Jaccard Index. Let’s say that on June 1st you would like to determine user similarity between kidz.com and carloverz.com. You represent each site with a set.

User sets.

The Jaccard Index J(A, B) of any two sets A and B is defined as

Jaccard Index, yo.

Since Joe is the only user who visited both sites on June 1st, the Jaccard Index between the sets above is

Jaccard Index for kidz.com and carloverz.com.

Intuitively, if two websites have a large proportion of users in common, the corresponding Jaccard Index will be high. Great, we’re done, right? We can plug the Jaccard Index into our hypothesis and just bid when we think the conversion rate of an incoming impression is going to be high enough, right? Not so fast…

Omnipresent Resource Scarcity

While the Jaccard Index intuitively foots the bill as a similarity metric, it is costly to compute. Why? Let’s look again at our preceding example.

User sets, revisited.

When we computed J(A, B), we needed to look at all users across both sets. For example, in order to compute the union, we needed to count all users present in both sets, while subtracting appropriately for the double-counted users. In software, this implies that in order to compute J(A, B), we need all distinct users of A and all distinct users of B in memory.

But websites can have millions of unique users on any given day! Oof. Is there any way we can significantly reduce the computational cost of the Jaccard Index at the expense of a little bit of precision in our final result? Affirmative.

Sketching Without a Pencil

There is a nifty software library called Apache DataSketches that enables you to get approximate answers to set queries quickly. For example, we can calculate approximations to |AB| and |AB| wicked fast.

DataSketches represents a set A with a small, approximate data structure called a “Sketch”. The word “Sketch” itself hints at something imprecise as it alludes to an artist’s sketch.

Here’s how it works. A Sketch needs only to retain a subset [of fixed cardinality] of the original data set in order to provide approximate answers to questions about the original data set. For example, let’s say you have a data set of users and you want to estimate the number of distinct users. You can approximate the distinct count by feeding users into a special type of Sketch called a “K Minimum Value Sketch” or KMV Sketch. Let’s visualize the process of feeding a user into a KMV Sketch.

The internals of a KMV Sketch.

A KMV Sketch relies on a uniform random hash function and an ordered list to keep track of the k smallest users. The KMV Sketch approximates the distinct count of users, n, with a proportion.

Approximating the number of distinct users.

For more details on exactly how a KMV Sketch works, check out this tutorial. While a KMV Sketch is merely one type of Sketch, it exhibits some characteristics common to all Sketches:

  1. It stores a representation of a subset of the original data set
  2. It has tuning parameters (e.g. k) that we can tweak to change our approximation

Given these characteristics, I will represent a general Sketch with the following graphic.

A Sketch.

Let’s represent A (our set of users visiting kidz.com on June 1st) as a Sketch.

Classic love story of set meets Sketch.

Our Sketch above takes a parameter k, which represents the number of users that should be kept in the Sketch. Thus, it doesn’t matter how big the original user set is — it could be 10 users, 100 users, or 1 googolplex users; the sketch will always have at most k users. For didactic purposes here, I chose k=3. This solves our ‘high memory’ conundrum by bounding the size of our internal representation.

Once we have our data represented this way, we can apply DataSketches approximation algorithms. How about an intersection of two sketches?

Hey look, it’s distributive!

Be advised that Sketches will cause you to lose some information! You cannot ‘un-sketch’ a Sketch in order to reproduce the exact set of users you started out with.

Because the Jaccard Index is merely computed using a set intersection and a set union, its approximate computation is directly supported by the DataSketches library.

Jaccardin’.

Wrappin’ Up

At Xandr, when we leveraged DataSketches for bidding on incoming impressions, we saw astounding results. We process about 25 GB of raw (URL, user) data from winning bids per day. Assuming fixed computational resources, we achieved a 4x speedup when switching our ETL pipeline to use DataSketches instead of the raw user data.

In addition, by randomly sampling websites with a sufficient number of users we found that over 99.9% of them had a true Jaccard Index within the upper and lower bounds dictated by the DataSketches library. Specifically, for a true nonzero Jaccard Index over 96% of websites contained the true Jaccard Index within bounds.

Estimated vs true Jaccard Index

So, there ya go! You can judge the relevance of an incoming impression by comparing it to historical impressions across similar websites. And you can determine this relevance very quickly and accurately using DataSketches.

On your quest to become the G.O.A.T. of the car industry, you have learned one way to judiciously advertise your whips. As your ad consigliere, all I ask is that you let me drive one.

Acknowledgements

I would like to thank Lei Hu, Chinmay Nerurkar, Rebecca Conneely, and Moussa Taifi for making this project a success!

We are hiring! If any of these challenges interest you, check out our open roles.

--

--