Sampling — the good, the bad, and the ugly

Aniruddha
i0exception
Published in
7 min readOct 16, 2019

Benjamin Franklin once said — “Those who give up essential accuracy for temporary speed deserve neither speed nor accuracy”. In the real world, however, you frequently have to trade off one for the other. This tradeoff becomes increasingly appealing as your data volume increases. In this post we’ll discuss some common sampling techniques and ways to get the most out of your data, especially as it relates to product or behavioral user analytics.

Before we look at how to sample, it’s important to understand what the data being sampled looks like. In most cases you’re going to collect events. These are an immutable record of user interactions. Each event typically has a timestamp and a user identifier associated with it. Optionally, you might want to collect some metadata with each event. Typically, once you add instrumentation to your apps or websites (or use a tool that automatically collects everything), these events are generated in response to every user interaction. So, if you’re already generating these events — why sample?

There are two main reasons why you might want to consider looking at a smaller subset of your data for insights — speed and cost. In some cases, you can get faster results if you decide to spend more on computation. However, not all computations are infinitely parallelizable.

How to sample

Whether you decide to sample by dropping data during collection or at query time, how you choose to ignore data matters. It’s important to have the sampling be random — otherwise you’ll run into sampling bias, which makes analysis hard. There are a few ways to do this.

Sample every event

This is the most naive way to sample data, but it works well if you only care about aggregates. Here, the decision to sample is independent of the event or user being tracked. When you run aggregates, you can multiply by the inverse of the sampling factor to get an approximate value. The biggest drawback of this approach is that any kind of analysis that depends on a sequence of events (like a funnel report) is usually incorrect.

Sample high volume events

Not all events are made equal. You are likely to have a few outliers that contribute the most to event volume. Random sampling for these outliers and collecting all other events un-sampled works well in practice. This has the same drawbacks as the previous approach, but the impact is restricted to analysis that spans the outliers.

Sample all users

For user analytics, every event is likely to have an associated user identifier. This represents the individual you are tracking information about. The goal of this approach is to keep all activity for a small sample of users and discard all activity for the others. If the user identifiers you keep data for are selected at random, you can extrapolate the results to get accurate numbers. The good part about this approach is that it works for aggregates as well as for any analysis that depends on a sequence of events, so long as it is per-user.

Sample high volume events by user

Here, we take the good parts of approaches 2 and 3 and combine them. We sample events by user but only restrict the sampling to high volume events. Any analysis that you do on events that don’t involve outliers gets full fidelity whereas anything that’s done across outliers still has accurate numbers as long as the analysis is done per-user. Most major analytics providers let you do this.

Sample by users but always track certain populations

This is basically the same as the previous approach, except you have some way to always track a specific set of users based on some pre-defined criteria. Say, you want to track everything about users who pay you more than $1000 per month — you can do that by always including every event for any of these users. This works well if this sample is relatively small compared to the rest of your user base and you are careful about addressing the edge cases of tracking when someone enters or exits this special population.

User Identity Management

Before we look at where to sample, it’s important to look at one of the biggest challenges with user based sampling. In this day and age, people have multiple devices that they might interact with your application or website on. In addition, users might do the bulk of their activity while logged out and only identify themselves when they have to. This flexibility makes it hard to decide which user identifiers should be in the sample and which shouldn’t — mainly because it’s a chicken and egg problem.

Anonymous and Logged In activity

For most applications and websites, some kind of user identification is required to interact with the product. There are exceptions (as we’ll see later), but for a majority of products, Login acts as the great filter. Once a user logs in, the decision of whether to include it in the sample can mostly be made on the basis of whatever unique identifier your database has for that user. It’s also possible to tie back any of their anonymous activity to the identified user based either on heuristics or actual knowledge (if someone logs in on a device where you previously tracked anonymous activity, there is a good chance that the anonymous activity can be tied to the logged-in user).

However, there are entire verticals where a bulk of your users are going to be anonymous. Travel, e-commerce, search, video etc. see a lot of anonymous activity and users don’t necessarily identify themselves during an interaction with your product.

Users with multiple accounts

The other challenge that some products run into is where users have different identifiers on different platforms. You might use a phone number to identify on the app and an email address to identify on a website. Additionally, you might also allow your users to identify using social media accounts. Tying all these activities back to the same user typically requires some kind of heuristics or best-effort matching and it usually happens long after you’ve been tracking information with these different identifiers.

Where to sample

User identity management plays a big role in determining the utility of your approach to sampling, because once you decide how you want to sample your data, the other important decision you’ll have to make is where to do this.

Broadly speaking, you have 2 choices —

  1. collect everything and sample when you run queries.
  2. drop data during collection and run queries on the sampled data.

If you’re considering sampling as a way to reduce costs, it’s helpful to understand the 3 types of costs associated with data —

Collection costs are those associated with tracking and processing the data all the way up to the point where you can decide whether to include the event in the sample or not.

Storage costs are what you pay for keeping the data around at rest. These costs compound over time as the data footprint increases.

Query costs are what you pay for processing the sampled data to get meaningful insights.

Here’s what the costs look like based on the approach you take

+----------------------+-------------+----------+----------+
| Option | Collection | Storage | Query |
+----------------------+-------------+----------+----------+
| Sample at query | Full | Full | Sampled |
| Sample at collection | Full | Sampled | Sampled |
+----------------------+-------------+----------+----------+

Sample at query

This is a little more expensive than the second one because you pay full storage costs and your collection costs might be a little higher depending on how early in your collection process you can determine whether a user falls in the sample or not. However, it has very few drawbacks because you don’t drop any of the data so you can merge user activity as and when you discover connections between anonymous, logged-in and users with multiple accounts. If you can afford to, collect everything.

Sample at collection

If you absolutely must drop data, there are a few ways to try to minimize the impact —

Sampling Technique

Always sample just the high volume events by the user identifier and keep everything around at full fidelity. This reduces the impact of sampling to any analysis that involves the outliers in terms of volume.

Anonymous vs. Logged In users

If your product has low anonymous activity and most users identify themselves before any interaction, you might be able to get by with keeping a copy of all the anonymous data and only sample data for users who have identified themselves. This gives you full visibility into any anonymous activity and at the same time, any analysis that spans anonymous and logged-in usage is correct.

If your product has high anonymous activity and low logged in activity, flip the two — sample all the anonymous data and keep all the logged in activity around for analysis. The drawback of doing this is that analysis spanning anonymous and logged-in usage will be incorrect.

Unfortunately, for most other cases, sampling at collection results in either incomplete or inaccurate data and there’s no real way to counter that.

Conclusion

Sampling plays an important role in improving the speed of analysis and, in some cases, reducing costs. If you can afford to pay the additional processing and storage costs, always sample while running queries. The costs for running queries are typically much higher than the other two, especially as you scan the data multiple times for different kinds of analysis. Keeping a full copy of all the data also lets you use that data for any kind of analysis that involves machine learning or statistical modeling. It also lets you run exploratory analysis on a sample while still retaining the ability to run more important queries on the full dataset.

If you absolutely can’t afford to keep a full copy, try to minimize the impact of sampling by reducing the scope of user identity management challenges on your choice of sampling technique and make sure that you factor in all the corner cases when interpreting the results of your queries.

--

--

Aniruddha
i0exception

Currently, eng @mixpanel. Previously @twitter, @google