Tiered Reservoir Sampling: a Solution for Large Quantities of Streaming Data

Published in

Trimble Maps Engineering Blog

9 min readJul 24, 2019

TAMI (Trimble Automated Map Improvements) is a recent project we have been developing at Trimble MAPS that takes anonymized customer GPS data (in the form of latitude/longitude points with speed, heading, and other data) and uses it to automatically detect and report potential map data issues. TAMI is constantly producing lists of new issues, but at the time of writing this post, over 40,000 map data changes have been made due to suspected issues reported by TAMI.

One of the largest challenges TAMI faces is dealing with a colossal amount of customer data. We have approximately 10 billion individual pings from various customer sources, and we host this in a Mongo database. Scanning through this vast database, simply inspecting each element, takes days to weeks. This makes simple questions like “which county has the most data?” very difficult to answer, even while leveraging indexing in the database to accelerate queries¹. An important observation is that these aggregate queries/questions used for decision making (i.e. which counties are top candidates to review for map improvements) don’t need to be exactly correct. If we had a random sample of the data, we could query the sample and extrapolate the results to the entire database. As long as we can quantify the accuracy of this extrapolation, we can confidently answer arbitrary aggregate questions about the data much more quickly. Although creating the random sample would be expensive, it could be cached and reused.

This brings us to another complexity of the streaming data — our customers are constantly driving vehicles around and to capitalize on that, we are constantly adding the new anonymized data they generate to our database (tens of millions of new points are added each day). The data in the database is constantly changing, which provides a challenge for having a random sample of the data as that sample would expire quickly, and regenerating it would be prohibitively expensive (even on a week-to-week basis). Another observation is that although the data is constantly changing, a random sample from one day is almost a random sample for the next day. This begs the question as to whether or not a simple operation can be done to update the random sample to reflect the new database, avoiding the need to have to recreate the entire random sample. There is indeed a way to perform this kind of update, and using it to maintain a random sample of the data in tandem with a constantly changing larger database is known as reservoir sampling.

Interestingly, the operation for maintaining the random sample as new data streams in is actually pretty simple. Consider a large database of size N, and a random sample of size M. There are only 2 rules to follow while streaming new data into the database of size N:

For each element of data, also keep it in the reservoir with chance M/(N+1).
If the data is selected to be kept in the reservoir, evict another element in the reservoir at random.

The desired property of a random sample is that each element in the large database has an equal chance of being in the sample (size of the sample divided by the large database size, explicitly M/N in this example). It can be proved via induction that this property is maintained if we follow the above 2 rules.

Proof by induction

Base case:
While N < M, the reservoir is ‘unfilled’, and as data streams in, all data is put both into the reservoir and into the larger database. Since the reservoir and the larger database are identical during this process, the reservoir is trivially a random sample of the large database. At the point where N = M, the sample still contains all the data in the larger database. Mathematically speaking, each element has a chance M/N = 1 of being in the sample, satisfying the required property of a random sample given above.

Inductive case:
Assuming that we have a random sample of size M and a full database of size N, after we add 1 more element form the data stream, each element should have a chance M/(N+1) of being in the sample.

Chance of the (N+1)th element being in the reservoir:
By rule 1, the (N+1)th element has a chance of M/(N+1) of being included in the reservoir.
Chance of any other element being in the reservoir:
The chance that any of the original N elements are in the reservoir is given by the chance that the element was already in the reservoir before the (N+1)th element, multiplied by the chance that it stays in the reservoir. The chance of staying in the reservoir is the chance that either 1) no reservoir replacement occurs, or 2) a replacement occurs, but some other element in the reservoir is replaced.

Combining these expressions, gives an overall chance of one of the original N elements both 1) being in the reservoir, and 2) staying in the reservoir after the (N+1)th addition as:

This simplifies to M/(N+1), thus proving the induction.

For a more intuitive argument, consider both an element early in the stream and late in the stream. The early element has a high chance of being included in the reservoir, but also has a high chance to be evicted. The later element, on the other hand, has a much lower chance of making into the reservoir, but once it’s in, it also has a much lower chance of being evicted.

The next challenge is quantifying how accurately the results of queries in the reservoir can be extrapolated to the large database. There are two primary types of queries one might run on the reservoir; the first involves using the sample to calculate an estimate of an aggregate value (ex. “what’s the average speed of points in Mercer county?”). The second is using the count of results in the reservoir to predict the count of results from the same query in the large database (ex. “what’s the number of points in Mercer county?”). For the first type of query, we can fairly easily calculate a mean and a standard deviation from the sample (which has many results), and use these to describe the true statistic in the large database. For the second case, however, it’s a little more challenging to quantify the accuracy as it’s not immediately obvious how to calculate the standard deviation from the singular count value returned.

Let the true count of results for the query in the large database be Λ (this is the value we’re using the extrapolation to try to calculate). For a reservoir of size M, the expected count (let this be λ), is given by λ = Λ(M/N). However, since the reservoir is a random sample of N, the count in the reservoir will likely not be exactly λ. Let the count returned by our reservoir be λ`.

In order to determine the accuracy of λ` (i.e. how close it is to λ), we have to consider what would happen if we had many different reservoirs and could make this query on each (if we had many different reservoirs, we would get a count from each, and this number would likely be different from reservoir to reservoir). The distribution of “count” values returned by all the theoretical reservoirs would follow a Poisson distribution, with the expected value of the Poisson being λ.

Conveniently, it is a known property of Poisson distributions that the variance is equal to the expected value. The variance is also equal to the standard deviation (σ) squared. Thus, the standard deviation on our count result λ`is equal to the square root of λ. We don’t know the exact value of λ, but λ`is a close approximation and can be used instead. The standard deviation on the count returned by the reservoir is equal to the square root of the count itself — explicitly, σ² = λ`. This gives a large relative error when the count is small, and a small relative error with then count is large.

This leads to a final observation that can be used to even further improve this process. The accuracy of results returned from the reservoir are proportional to the number of the results from the query. The time to run the queries with a geometric component (the primary use case) are also proportional to the number of the results from the query. This leads to an interesting situation where the results from the reservoir are accurate when a query is large (when the query is slow to run on the larger database), and inaccurate when the query is small (when the query would run quickly on the large database). This means that in certain situations it’s ideal to query from the reservoir and in others it’s ideal to query directly from the larger database. What separates these two situations is simply the size of the response to the query (which itself is proportional to the total size of the reservoir).

Taking this to the next step, we can add multiple reservoirs of varying sizes (differing by an order of magnitude), each of which having a different required “response size” to ensure accuracy. This allows us to guarantee a certain level of accuracy and speed in all situations. If a reservoir doesn’t have enough data to accurately answer a query, it will know that very quickly, and we can query a larger reservoir. If no reservoir can give accurate estimates to a query, the query can run quickly on the main database (which will be exact and is guaranteed to run quickly). This allows us to answer arbitrary questions ranging from “how many points in the Trimble MAPS parking lot?” to “how many points in Texas?” quickly, and with a guaranteed accuracy. The following diagram illustrates this concept with two reservoirs, but the idea generalizes to any number of reservoirs (for TAMI, we’ve found three tiers to be ideal).

TAMI implements this tiered reservoir sampling solution to deal with answering questions needed for data management that are made difficult by the quantity of our customer data². The final cherry-on-top for TAMI’s tiered reservoir samples is that tiered reservoir samples are also what allow us to visualize the anonymized GPS points! The GIS team that investigates the suspected issues found by TAMI need to be able to visualize the data. Often, there’s too much data to be displayed in their QGIS software at any reasonable zoom distance. The tiered reservoirs work perfectly to allow for a subset of the points to quickly be drawn. Tiered reservoir sampling has been a great fit for managing such a large amount of streaming data and is a cornerstone of TAMI’s success.

Interested in joining our team at Trimble Maps? Click here!

Footnotes:

¹ particularly for our needs, we do a lot of geometric based querying. Simpler indexes can cache ‘count’ values and don’t need to inspect each element in the query, but that’s not as easy with locations. A question such as “how many points are in NJ” is as difficult as “give me all the points in NJ”. Although most points are ignored (i.e. all points not in NJ), with such a large amount of data, just the points matching the query can be vast.

² although out of the scope of this blog post, there is some more complexity to TAMI’s tiered reservoirs because TAMI only wants to use current data and archives data older than 2-years. The reservoirs must also have their data archived to not become stale, which adds some complexity because now in addition to needing to alter the reservoirs as data is added to the system, they must also be altered as data leaves the system.

Tiered Reservoir Sampling: a Solution for Large Quantities of Streaming Data

Written by Miles Carlsten