Fast Approximation on Massive Datasets

Published in

inspiringbrilliance

10 min readApr 28, 2020

Source — https://www.pngitem.com/middle/iRomxwh_transparent-grasp-clipart-approximation-symbol-hd-png-download/

It is better to use a crude approximation and know the truth, plus or minus 10 percent, than demand an exact solution and know nothing at all!
The Complete Murphy’s Law: A Definitive collection Source

For large datasets with multiple dimensions, summaries computed for each dimension can be quickly combined to obtain an accurate summary of various combinations of the dimensions (union, intersection, etc.). We have recently built a service to estimate the number of people reached by new audience segments in real time, for queries with any combination of dimensions. Here’s how we did it.

Audience Segmentation

We have been involved in building a data product for one of our customers and one of core problems was to leverage the data to have the capability to target audience based on user behavior.

Target the right audience — Credit — https://undraw.co/

We have the ability to define audience “segments”. A segment represents a group of users and is defined by a set of attributes that will help target a certain set. For E.g. Postal Code (Users living in Chennai) or Interests (Users interested in “Automobiles” — this itself is another topic)

Audience funnel (Source — https://wikinggruppen.se/google-shopping-remarketing/)

When a user is looking to create a segment to target a certain group of users, it is important to know the approximate size of that user group so that the impact of the segment is clear — whether to proceed or widen/narrow the funnel i.e a bigger audience or a more niche small set.

In our case, we had around 12 dimensions through which the audience could be queried and get built using Apache Spark and S3 as the data lake. Our data pipeline was ingesting TBs of data every week and we had to build data pipelines to ingest, enrich, run models, build aggregates to run the segment queries against. The segments themselves took around 10–20 mins depending on the the complexity of the filters — with the spark job running on a cluster of 10 4-core 16GB machines. In a real world, to create the segments that is appropriate to target (especially the niche ones) can take multiple iterations and that is where approximation comes to the rescue.

Approximating segment sizes

Computing a segment size can be very time consuming because of two major problems:

Segment element count may be very large
There can be any kind of combinations among various set

Segment built from filters like location=Chennai, interest=Music, Spending=10k-50k

The problem of approximating the size of an audience segment is nothing but count-distinct problem (aka cardinality estimation): efficiently determining the number of distinct elements within a dimension of a large-scale data set. This has been a much researched topic. There are probabilistic data structures that help answer in a rapid and memory-efficient manner. An example of a probabilistic data structures are Bloom Filters — they help to check if whether an element is present in a set. The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set. Let us talk about some of the probabilistic data structures to solve the count-distinct problem.

Sketches

Sketches implement algorithms that can extract information from a stream of data in a single pass, which is also known as “one-touch” processing. Some sketches can be deterministic, although most sketches are probabilistic in their behavior and take advantage of various randomization techniques.

There are many different versions of these sketches, but they all build on the following observation: if I store some information about specific patterns in the incoming data, I can estimate how many distinct items I have observed so far.

HLL Sketch

The most popular approach to solve the count-distinct problem is to use the HyperLogLog (HLL) algorithm, which allows us to estimate the cardinality with a single iteration over the set of users, using constant memory.

The popular databases Redis and Redshift as well as the Spark computing framework already have a built-in implementation of this algorithm.

Algorithm

Proposed by Flajolet et al (2007) — original article.

The algorithm is described by 2 parameters:

p — number of bits that determine a bucket to use for averaging

h — Hash function, that produces uniform hash values

rank — number of leading zeroes

Lets use p=3 to define a bucket (then m = 2³ = 8 buckets)

M =0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

Let’s say we inserting element x where h(x) = 01100111
Use p=3 bits for buckets and least L-p = 5 bits for rank i.e. bucket(x) = 011 = 3, value(x) = rank(00111) = 2

3. Store the value(x) in the position bucket(x) i.e. 3rd bit from the left

M = 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

4. Estimate the cardinality using the HLL Formula

HLL formula from https://en.wikipedia.org/wiki/HyperLogLog

NOTE: For small cardinalities, HLL has a strong bias!

In the venn diagram above depicting the segments, we want to do unions/intersections across multiple criteria/sets to get the distinct counts. To calculate unions, we need two arrays M1 and M2 with calculated p values. Based on these two arrays, we calculate a new array M. For each element we apply a formula similar to the one in step 3. M[i] = max(M1[i], M2[i]). This will allow us to get a new base array, so we can perform evaluations on it. For intersections, there is no straight forward easy way to compute the intersection of sets. (more info here)

Kth Minimal Value Sketch

The KMV Sketch is built on an algorithm that has the same speed and precision as HLL but supports intersections by design.

KMV objectives:

Estimate the number of unique identifiers in the entire dataset in a single pass through all the data.
Retain no more than k values in the sketch at any one time

Estimated Count/Cardinality = k — 1/value of the kth sample V(kth)

Steps:

Maintain an ordered list of hash values
Choose minimum k values
Estimate = k-1/V(kth)
Reject hash > V(kth)
Reject duplicate hashes
Otherwise, insert in order, toss the top value, track V(kth)

KMV Sketch k=3 for a list with 10 elements (Source: https://datasketches.apache.org/docs/Theta/KMVbetterEst.html)

In the above image, k = 3, which means that we will keep the 3 smallest hash values that the cache has seen. The fractional distance that these k values consume is simply the value of the kth hash value, or V(kth), which in this example is 0.195. This is also known as the kth Minimum Value or KMV.

How many hash values (i.e. what is the value of k )do we have to retain to compute the estimate of n, n̂, accurately? As you might expect, the more samples we retain, the more accurate will be our estimate. To figure out what value of k is required, you must first determine what level of accuracy is required for your application. The graph below can serve as a guide.

Source — https://datasketches.apache.org/

The RSE corresponds to +/- one Standard Deviation of the Gaussian distribution, which is equivalent to a confidence of 68%. To obtain the Relative Error (RE, no longer “Standard”) at 95% confidence and the same value of k you multiply the RSE value by two.

Choosing k = 4096 corresponds to an RSE of +/- 1.6% with 68% confidence. That same size sketch will have a Relative Error of +/- 3.2% with 95% confidence. For k=4096, the hashtable takes around 32MB storage space(8 bytes per entry). Post building the sketch, in order to compute estimates, the hashtable is no longer required, only a compact sketch is required. The size of this compact form is a simple function of the number of retained hash values (8 bytes) and a small preamble that varies from 8 to 24 bytes depending on the internal state of the sketch.

For those interested in the mathematical proofs, Giroire has a straightforward and easy-to-follow development.

Set Operations

Even though KMV is just a distinct value estimator that estimates a count, there are some interesting probabilistic set operations that you can do with it as well.

Union

If you have the information for two KMV sketches, you can get an estimate to the number of unique items even without knowing the actual items in the sets.

To do a union, you just combine the minimum value lists, and remove the K largest ones, so that you are left with the K minimum values from both sets.

Post this, you can use the resulting set to estimate the value. This would be the same as tracking the items from both lists in a separate KMV sketch. You would end up with the same K minimums as if you took the K smallest values from both sets individually.

KMV(A union B) = KMV(A) union KMV(B)

Intersection

If you have the information for two KMV sketches, you can get the estimate of the number of common items. For this, we could leverage the mathematical concept of Jaccard index

Jaccard Index = count(intersection(A,B)) / count(union(A,B))

Since the union of A and B is the combined list of all items in those sets, and the intersection of A and B is the items that they have in common, you can see that if the sets have all items in common, the index will be 1 and if the sets have no items in common, the index will be 0. If you have some items in common it will be somewhere between 0 and 1. So, the index is just a measurement of how similar two sets are.

Basically, having the sketches of both lists, choose a random sample from both lists to calculate the Jaccard index for that range (by dividing the size of the intersection by the size of the union), and then use that to estimate an intersection count for the entire set based on the union count estimate.

count(intersection(A,B)) = index * count(union(A,B))

Implementation

We used the org.apache.datasketches library to solve the problem — This type of data structure exists in the datasketches framework and is called a theta sketch. It was developed at Yahoo and open sourced as Apache Datasketches. The core of this library is based on the KMV algorithm discussed above.

For estimating millions of audience points per category for one dimension with around 1000 unique values (E.g. location dimension could have 100 city values), it took around 30MB to store the sketches for each of those dimension values, 24KB per sketch.

Distributed computation of sketches using Spark

Summarization of data can be done in a fully distributed manner using Apache Spark, by partitioning the data arbitrarily across many nodes, summarizing each partition, and combining the results.

The key idea with respect to performance here is to arrange a two-phase process. In the first phase all input is partitioned by Spark and sent to executors. One sketch is created per partition (or per dimensional combination in that partition) and updated with all the input without serializing the sketch until the end of the phase. In the second phase the sketches from the first phase are merged.

Exhaustive sketches vs Long Tail

There are a few dimensions with dimension values in the order of 100,000s, where it wouldn’t make sense to precompute the sketches and store for every dimension value. In statistics and retail, there is a concept of long tail referring to distribution of large number of products that sell in small quantities, as contrasted with the small number of best-selling products.

In the context of audience segmentation, large number of audience are part of urban areas, common interest categories in contrast with other niche categories or low populated cities. How can estimate for all types of queries without exhaustive precomputation of sketches for every filter value? I will talk about this in Part 2 of this series. (Coming soon!!)