Representative Sampling: A Statistical Method to Derive Data Insights from Noisy Data Distributions

Published in

Walmart Global Tech Blog

16 min readMar 25, 2022

Introduction

Extracting customer insights from existing data is vital to a business’ ability to understand purchase trends, construct advertising campaigns, and personalize user experiences. Understanding customers’ past buy patterns, motivations, store visit numbers, etc. can help answer more nuanced questions, such as complex purchase trends, channel-specific spending behavior, future product interests, etc.

In the past, the primary way of doing this was meticulously designing and carrying out surveys on large numbers of customers. Surveying customers on simple statistics (such as how many times they shop every month) is relatively easy, but extracting data on more complex metrics such as short and long-term change rate trends, specific purchase category breakdown, etc., is nearly impossible to do via surveys.

With the advent of big data and omni-view metrics tracking customer behavior, however, it’s possible to approach this problem in a more technical manner.

The first step to understanding customer insights is building an individual profile for each customer. In order to construct these individual profiles, we collect and run all customer data through our ingestion and deep learning pipelines. Each of these profiles encapsulates a customer’s past transactions, various metadata, and all their interactions with Walmart. For instance, consider a customer that has purchased products at Walmart stores using two different credit cards, and that same customer has an online account through which they made more transactions. Our deep learning pipeline would aggregate that customer’s in-store and online personas and ultimately create one overall profile that contains all their purchases, interactions with Walmart, and an array of other user metadata.

**An example of a profile we create for a customer that used two different credit cards to purchase items in Walmart stores and also has an online Walmart account.**

Ideally, we’d be able to derive complex user insights straight from these customer profiles that we create. However, customer data is often quite messy; information from dozens of first-party and third-party data streams, inaccurate data from the customer’s side, and an array of other real-world issues make working with this data challenging. Having a single, accurate view of the customer is an open problem that becomes increasingly difficult as the scale of the business increases. At Walmart, which sees hundreds of millions of customer visits every week, even a small percentage of noisy customer data can make it that much more challenging to extract accurate insights.

Due to noisy and frequently missing data, some of our individual profiles for customers can be merged or fragmented. For instance, due to the nature of some transactions, certain purchases can sometimes have limited associated metadata, which results in the deep learning pipeline having access to sparse feature sets for some of the data. This can lead to fragmented user identities that include only a subset of a customer’s overall information.

One common case of this occurs when a customer uses a credit card to complete an in-store transaction. Let’s say this user also has an online Walmart account, but with a different credit card on their online account. If our data streams fail to fetch a rich set of metadata during the in-store transaction (since by default we don’t receive much information from a credit card transaction), the lack of data would make it difficult to identify the link between that customer’s in-store purchase and their online Walmart account. We would treat the customer’s in-store purchases and online activity as belonging to two different people. When calculating enterprise metrics such as the average number of user interactions with Walmart, treating the same customer as two different personas would negatively skew results. We are frequently onboarding new data streams and making modifications to our deep learning pipeline to reduce fragmentation, but sparsity in real-world customer data makes this process challenging.

Fragmentation: The image on the left represents the true profile that should be constructed for a customer. But due to sparsity in customer data, sometimes it’s impossible for our pipeline to connect all the in-store, digital, and other interactions a customer makes with Walmart. This results in our pipeline creating multiple profiles for the same customer.

Performing analysis to extract customer insights from this set of fragmented individual customer profiles would inevitably skew results, negatively affecting business decisions. We need a way to select a clean subset from this larger, noisy set of profiles that we can trust and use to derive complex insights.

Method Overview

The simplest way to tackle this problem would be to go through each individual profile and manually demarcate the pure profiles from the noisy or fragmented ones. Going through each profile in this fashion is impossible; not only is the scale of profiles immense but it is also difficult even for a human labeler to decide whether a profile is completely pure. A labeler would never know if information were missing or fragmented from a profile without manually going through every other profile created, which is infeasible.

Since manual labeling is not an option, we need another way to validate the purity of profiles. Due to surveys, customer reports, and data analysis on clean data streams, we have detailed information regarding simpler metrics such as average visits per customer, channel breakdown (digital vs. brick-and-mortar transactions), etc. These numbers have been frequently verified across multiple internal business teams and third-party data sources. For instance, we know that the “average number of visits per customer” statistic closely follows an exponential distribution (and we know its mean). Similarly, we know the distributions for some other simple, customer-level statistics with high confidence.

Evaluating the purity of a single profile is difficult, but if we could somehow utilize these statistics as ground truth, we can verify how pure and trustworthy a subset of customer profiles is. A subset that has aggregate statistics that closely follow the above global distributions (that have already been thoroughly verified by various sources) will likely be extremely pure. Here, a global distribution refers to a statistic calculated and verified on all Walmart customers. We could then use this subset to generate more complex customer insights.

In order to get this pure subset, we can extract subsets of customer profiles from the overall set until we find one whose statistics match the true global distributions that we have access to. The easiest way of doing this is to take random samples of profiles of some arbitrary size until we arrive at a sample that matches the global distributions. However, given the scale and impurity of the profiles, and the fact that we want the subset’s distributions to be close to identical to the global distributions, random sampling is an infeasible approach (taking large random samples from an impure set will just result in impure samples, regardless of how many times you sample).

Rather, we need to construct a more efficient random sampling procedure that can ensure our final subset of customer profiles meets the global distributions. We can call these verified global distributions “constraints,” since our subset needs to meet these statistical constraints. For example, take a look at the “average number of visits” constraint. At a global level, let’s assume the average shopper shops in-store every other week (approximately 25 times annually), and that the distribution follows an exponential curve as mentioned above. Since these findings have been verified at a global level by various sources, we want our subset to closely follow that same distribution. If our subset meets these distributions, we can confidently use it to extrapolate other consumer behaviors, predict future trends, etc.

The key aspect to note here is individual customer profiles in the final subset we choose can vary greatly, as long as the aggregate distributions of the subset match the global ones. For instance, in our final subset, we can have a customer profile that has shopped only once in the past year, and another one that has shopped 200 times. Since we don’t have any limitations on individual customer profiles (in fact having this variance is necessary, since we want the subset to mirror the true spread and behavior of the entire set of Walmart customers), we need to choose a method that can allow any customer profile to be included in the final subset, while simultaneously maintaining the aggregate distributions.

Since each constraint requires the subset of chosen profiles to meet some distribution (for ex. exponential or normal, or often an arbitrary simple discrete probability mass function), we need a way to figure out how well each individual customer profile matches the given constraint. Deriving a score for each constraint for each individual profile would tell us how closely that profile meets a given constraint.

Then, for each profile, we could aggregate each constraint’s “sub-score” to derive a comprehensive score that measures how well a given profile meets all the necessary constraints. Once we have these scores, we can selectively choose profiles of high score to be in our final subset. However, exclusively selecting highly scored profiles induces a bias, as there may be hidden correlations between certain constraints and the true distribution of Walmart customers. By taking a weighted random sample using these scores, however, we largely eliminate this bias (discussed further below).

To summarize the above method:

Formulate global constraints as discrete or continuous probability distributions.
For each individual customer profile, generate a sub-score for each constraint.
Aggregate the sub-scores to get one overall score for each profile.
Use these scores to perform weighted random sampling and extract a sample of some arbitrary size.
Evaluate the sample to ensure it meets the given constraints, and repeat as necessary.

We call this final sample a “representative sample” — a smaller set of profiles that represents the true global distributions of various attributes, and we can utilize it to extract more complex customer insights.

Implementation

The first step in the weighted random sampling approach is to design a way to generate sub-scores for each constraint (for each customer profile). To do this, we formulate each constraint as some type of distribution that we can then use to score customer profiles. Most of our constraints can be summarized in one of two ways: discrete probability mass functions or continuous probability density functions. The easiest way to explain this is to walk through an example of each.

Taking the previously mentioned “average number of annual store visits” constraint, we know that this value follows an exponential distribution with a verified mean value (for privacy purposes, let’s assume this mean is 25 visits per year). We can then use this information to score each customer profile based on how many visits each profile made.

To score a customer profile, calculate how many times that profile has visited a store in the past year (using transactions and events data) and plug that value into this distribution to get a probabilistic score. For example, a customer profile that has shopped 5 times in the past year would get a score much higher than a profile that has shopped 100 times. (Note: since this is a continuous distribution, the actual probability at any point is 0. So, we use the probability density at particular values as the relative scores).

The “number of store visits” constraint has a large number of discrete buckets — one for every possible number of visits. Other constraints, however, can be quite different. For instance, another important global constraint to analyze is the overall channel breakdown of customers, i.e. what percentage of customers shop online vs. in physical stores vs. both online and in-person. This constraint only has three buckets, so fitting a continuous probability distribution is not appropriate. Rather, given the ease of stratification here, using the true ratios to score each customer profile is simpler and more representative. Let’s say the ratio of online to physical to both online and physical is 20%:40%:40%. Each customer profile would then be scored according to the bucket they fall into (a score of 0.2 to a profile if it is an online profile, 0.4 if it is an in-store profile, and 0.4 if it is both a digital and physical-store profile).

Similarly, we generate a sub-score for all other constraints for each customer profile.

Sub-Score Normalization

Using this method, we can generate probabilistic sub-scores for all customer profiles based on their metadata. However, impurities present in our original global set of customer profiles pose an additional challenge. Specifically, given the noisy and sparse nature of customer data, instead of building one unique customer profile for each true consumer, sometimes there are multiple profiles created for a single shopper. Scoring and randomly sampling from this fragmented set of profiles would lead to a final representative sample that fails to meet our global truths.

We use a deep learning pipeline to prevent the fragmentation of profiles, but despite our best efforts, a minority of profiles are still being fragmented due to the lack of customer data. Even this minority of fragmented profiles can skew statistics. For instance, consider a customer that uses three different credit cards over the course of a year and shops 10 times at Walmart stores with each credit card. Sometimes due to the lack of data, we treat this customer as three different shoppers each of which visits Walmart stores 10 times, even though in reality there is only one customer that shops 30 times. Due to fragmentation, the average number of store trips constraint will be further right-skewed (some customer profiles will have seemingly lower numbers of store trips).

We need to remove the effects of these fragmentation-related impurities during the sampling process. The best way to do this is to cancel out the effects of the current global distribution by normalizing by the size of each bucket in the global set of customer profiles. The easiest way to understand this is with a simple “balls in a bag” example. Imagine you have a bag of a hundred balls, 10 red and 90 blue, and you want to select 10 total balls, 5 of which are red and 5 are blue using weighted random sampling. Similar to the “channel breakdown” percentage constraint, here the target percentage breakdown would be 50%:50%, but assigning a weight of 0.5 to all red and blue balls would lead to a sample that is just as skewed as the original bag (1 red and 9 blue) in expectation. To remove the initial skewness, we would need to divide the sub-score by the relative frequencies of the red and blue balls. So, to all red balls, assign a score of 0.5 / 0.1 = 5, and the blue balls would receive a score of 0.5 / 0.9 = 0.56. Then, by using these scores to do weighted random sampling, the final sample (in expectation), should contain 5 red and 5 blue balls.

In the “channel breakdown” constraint, we would divide the initial sub-scores of 0.2, 0.4, and 0.4 for online, in-store, and both online and in-store respectively, by their corresponding percentages in the initial global population. For example, if 8% of the global set of customer profiles contains customers that shopped exclusively online, then we would give a score of 0.2 / 0.08 = 2.5 to all those customer profiles.

Similarly, in the “number of store visits” constraint, for all customers that visited a store 3 times for example, we get the probability density of the exponential distribution at the value 3 as the sub-score. We then divide this sub-score by the total number of customer profiles in the global set that also visited a store 3 times in order to remove the negative effects that data impurities can have on the global set of profiles. Essentially, the final sub-score we give to each profile is the target score / global frequency.

Sub-Score Aggregation and Weighted Random Sampling

Next, we need to aggregate each of these sub-scores to get one overall score for each profile. Since the sub-scores encapsulate probabilities associated with various constraints, we can aggregate them in a similar fashion to how probabilities are typically aggregated. When dealing with independent variables, a joint probability can be calculated by multiplying the individual probability values. In our first iteration of generating the representative sample, we treat our constraints as independent (discussed further below). Therefore, to derive one aggregate score for a customer profile, we multiply the profile’s sub-scores. So if a profile got a sub-score of 0.7 for the “number of visits” constraint, and a sub-score of 0.2 for the “channel breakdown” constraint, the profile’s final aggregate score would be 0.14.

The next step is to use these scores to perform weighted random sampling. To do that, we first normalize the scores of all customer profiles such that they sum to 1. The normalized scores represent the true probability that each customer profile is picked in the final sample. Using these probabilities, we can then generate a random sample (without replacement, since we don’t want the same customer profile to be present multiple times in the final sample).

Here is a simple example of aggregating the scores and converting them into probabilities:

We would then use these normalized scores to select a random sample.

An alternative to performing weighted random sampling would be to just take the top N highest-scoring profiles as our final representative set. This method, however, has several drawbacks. First, taking just the top scoring profiles would induce unwanted biases into our final set. For example, it could be the case that customers in the state of California meet the “number of store visits” constraint much better than other states and therefore have a higher aggregate score. Choosing just the top N scoring profiles would result in a final set that only contains customers from California, which is far from representative of the true nature of the global set of customers. Next, each constraint can be viewed as a set of buckets. For instance, the “channel breakdown” constraint has three buckets (in-store shoppers, online shoppers, and shoppers who buy both digitally and in-store). Customer profiles receive different sub-scores based on the bucket they fall into (sticking with the sample numbers used above, a score of 0.2 to a profile if it is an online profile, 0.4 if it is an in-store profile, and 0.4 if it is both a digital and physical-store profile). If we choose the top N scoring profiles, customers that only have an online presence would not make it into our final sample, since they receive a lower score. Since we want to meet the true ratios/distributions for each constraint, and since we want to maintain the true global data distribution, we need to take a weighted random sample rather than choosing the top N scoring profiles.

The last step is to verify how well the final random sample meets the aggregate statistics. For constraints with fewer buckets, such as the channel breakdown one, we can easily measure the differences in ratios for each bucket. For constraints that are more continuous in nature, such as the number of store visits, we can calculate the difference between the global distribution and that of the representative sample using various methods to decide whether the final sample is acceptable. The simplest way would be to analyze the means of the distributions and verify that they are identical. For a more accurate sample, we would need to analyze differences in the entire shapes of the distributions (either visually if a rough similarity suffices or mathematically using measures like KL divergence if a more precise match is required).

The feasible size of the representative sample primarily depends on two factors: the number of constraints and the acceptable margin of error for each constraint. The more constraints that are required to be met, or the smaller the acceptable margin of error, the smaller the entire representative sample will be. On the other hand, if the requirements are more relaxed, it would be possible to generate a larger sample.

Results

For our use case, we had a set of half a dozen constraints but a small acceptable margin of error. Using this method, our team was able to generate a representative sample of several millions of customer profiles that matched the global ground truths to a 2% error rate for each constraint. With our representative sample, we’re now able to extract complex enterprise metrics for this smaller set of customer profiles and extrapolate these results at a global level for all Walmart customers. Analyzing demographics, transaction behavior, purchase trends, etc. on this smaller, more trustworthy representative sample gives us an accurate picture of how these metrics look at a global level, where profiles are messier due to the nature of real-world data. We simply scale up the metrics from the representative sample to the global population. For instance, if the representative sample consists of 10 million customer profiles, we would multiply any metrics we calculated on this set by a scale-up factor to get the metric value at the global level. This scale-up factor is simply the total number of customers divided by the size of the representative sample (in our case 250 million / 10 million = 25). So taking a hypothetical example, if we know that 100 thousand households in the sample have purchased jeans in February, at a global level, we could surmise that 2.5 million total households have bought jeans that month.

Conclusion

Generating a representative sample allows for easier, more accurate calculations of complex enterprise metrics that are difficult to attain when working with real-world data. With this method, we now have a better way of tracking trends, behavior, and aggregate statistics at the global level!

One challenge we are yet to address is our assumption that the constraints are independent. In most cases, constraints are almost certainly dependent on each other. For instance, the “average number of store visits per customer” statistic depends on whether the customer profile is classified as online, in-store, or both. Coming up with a joint probability distribution that involves both continuous and discrete constraints, each of which follows their own separate distribution (normal, exponential, etc.), is challenging. This is why in our first iteration, we generated a sample by treating each constraint as independent and were able to create a sample that met our constraints within a small margin of error. Moving forward, as we incorporate more constraints into the sampling process, we aim to find a way to generate more accurate scores by avoiding this independence assumption.

Another important point to note is that constraints don’t necessarily even have to be accurate or encapsulate the ground truth. In our case, constraints are carefully-calculated ground-truth statistics that we aim to meet in an effort to create a sample that is as close to the truth as possible. (As we improve the purity and reduce fragmentation in our customer profiles, we’ll be able to generate more accurate ground-truth statistics using these profiles directly without relying on surveys and customer reports from business teams.) But, they can be any desired distribution you want to select based on your business needs, which opens up the possible use cases for the representative sample method.

Acknowledgements

Many thanks to Mridul Jain, Saigopal Thota, Rijul Magu, Antriksh Shah, Vijendra Kulhade, Parima Jain, Puja Maniktala, and Achyutha Kalal for helping with the algorithm design and engineering efforts.

References

1. Byrum, J.D. “The Grocery List: Why 140 Million Americans Choose Walmart.” Walmart Corporate, 3 Oct. 2016.

2. Efraimidis, Pavlos, et al. “Weighted Random Sampling.” Research Academic Computer Technology Institute, 2005.

3. Thota, Saigopal, et al. “Building Identity Graphs over Heterogeneous Data.” Conference Cast, Spark + AI Summit 2020, 25 June 2020.

Representative Sampling: A Statistical Method to Derive Data Insights from Noisy Data Distributions

Written by Neil Palleti