SnackNation Tasting Panel Performance: Upsampling and Hypothesis Testing
At its core, SnackNation’s business relies on providing boxes of healthy, high quality snacks to customers. Our boxes can include up to 15 different snacks every month and with a variety of box options available, a fundamental question we constantly have to ask ourselves is: “What snacks should we be including in this month’s boxes?”
To answer this, our Brands team runs an internal tasting panel that taste tests and vets all potential snacks to check for quality and whether or not they’re snacks our customers would enjoy. The panel participants rate each snack on a variety of weighted factors (such as taste and texture) which then get aggregated into an overall snack score. A snack is accepted into or rejected from our snack boxes depending on whether or not its aggregated snack score passes a certain threshold. This tasting panel is critical to our ability to effectively curate boxes and ensures that we are sending out the best product we can each and every month. As such, we have an intrinsic interest in making sure that our tasting panel represents the tastes and preferences of our customers as closely as possible.
This is admittedly a difficult problem. The size of our tasting panel is smaller compared to the size of our customer base. Since we review dozens of up and coming snacks each month and partner with hundreds of brands a year, a risk exists that our tasting panel might miss out on a product our customers would love or fall head over heels for a product our customers think is just okay. As a Business Intelligence team, part of our job is to help the Brands tasting panel minimize this risk and devise ways to intelligently adjust how the tasting panel understands products so it accurately represent our customers.
One of the ways we do this is by conducting hypothesis tests on customer feedback data and tasting panel ratings. Our aim is to compare the mean score our tasting panel gave a product (x̅_tp) to the mean score our customers gave the product as feedback after they received it in their box (x̅_cf). More specifically, we are interested in seeing if there is a statistically significant difference between these means. If a difference does exist, it would suggest our tasting panel was incorrect in its rating.
More formally, to state the null and alternative hypothesis (H₀, H₁) for testing:
H₀: μ_tp - μ_cf = 0 → μ_tp = μ_cf
H₁: μ_tp - μ_cf ≠ 0 → μ_tp ≠ μ_cf
H₀ interpretation: The mean scores for a product from the tasting panel and customers are the same
H₁ interpretation: The mean scores for a product from the tasting panel and customers are different
In this case, if we fail to reject the null hypothesis, it would imply our tasting panel correctly predicted how much our customers liked the product. Rejecting the null hypothesis would imply that the panel was incorrect.
In order to conduct this testing, we opt to use a Welch t-test that assumes the variances of our tasting panel and customer rating are unequal. The t-statistic (t) and degrees of freedom (v) are defined as:
Where x̅ is the sample mean, s is the sample variance, and N is the sample size. Using the t statistic and degrees of freedom, a p-value can then be calculated to make a decision about the null hypothesis. (More on this calculation can be found here)
Running this test on each of our products and evaluating the p-values to come to our decision about the null hypothesis would be our next step, but we face a problem. Due to our tasting panel sample size being fairly smaller than our customer data, we run into an issue of our tests having low statistical power.
More can be read about power here, but to briefly summarize, power is the probability that a test does not experience a type II error (where a null hypothesis fails to be rejected when the alternate hypothesis is in fact true).
From a probability theory perspective, power (π) is defined as:
π = 1- Pr(fail to reject H₀ | H₁ is true)
For our application, we want to have a high power level. As the higher the power, the less chance there is of our test indicating our tasting panel made a correct decision when in actuality they did not. Generally, three factors contribute to a test’s power:
- The set level of significance
- The effect size. For our tests, the magnitude of the difference between the means of a product
- The sample size of the tasting panel and customer data for a product
In our case, the level of significance is set but the effect size and sample size vary on a product by product basis. While we can’t control our effect size as it’s intrinsic to our given data, we can artificially increase our tasting panel sample size (N_tp) and thus increase the power of our testing.
These next few steps rely on the assumption that if we were to increase the sample size of our tasting panel, the distribution would remain the same.
While there are a variety of ways to simulate our tasting panel, we ultimately settled on simulating data based off of the distribution given by the original tasting panel scores. More precisely, we simulate a truncated normal distribution given the mean and standard deviation of the initial tasting panel respondents for a product. From here, we append a random sampling of values from the simulated distribution to our tasting panel data and then conduct hypothesis testing on the original customer feedback and now upsampled tasting panel data.
While usually one could use a standard normal distribution, our tasting panel data comes from a scoring system with a finite range of possible values. As a result, there is a minimum and maximum value associated with each product in our tasting panel. If we were to use a normal distribution without limiting the possible values, our simulated data could have values that fall outside the range of possible values our tasters could assign, a cause for inaccuracy. More can be read about truncated normal distributions here.
I experimented with exactly how many randomly drawn samples we should be appending to our tasting panel data. To do this, I calculated the power, t-statistic, and p-value of each product test comparing N_cf with each possible value of N_tp. From these calculations, I created plots of each value as N_tp increases in order to look at what N_tp the power and P-values begin to converge at. Here we are looking for a value of N_tp where an additional unit of N_ tp provides very little marginal benefit to the outcome of our testing.
What I found is that there is no hard and fast rule for the size of N_tp that will always yield the best result. Since we are simulating a distribution to then test on, there is intrinsic error introduced as we make the assumption that our tasting panel data results remain consistent as we increase the dataset’s N size. It therefore is not advantageous to set N_tp to an arbitrarily large number because as the convergence results in our tests show, there is little marginal benefit to increases in power or p-value after a certain point. With these factors in mind, it ultimately depends on the given tasting panel and customer feedback distributions for a product that determine what N_tp is appropriate.
Due to each test needing a different N_tp, it’s best practice to manually examine each product’s two distributions and look at how increasing the tasting panel N size affects the power and p-values of each product’s respective test. However, due to the quantity of products we deal with, this method is prohibitively expensive in terms of analyst time. As such, there needs to be a generalized rule in our practice for determining the N size we should upsample to that gives the most acceptable results. Based on research, experience, and testing, we’ve found we get the best results when N_tp=N_cf.
There are a variety of reasons for why we arrived at the N_tp= N_cf conclusion. The first is that our test results of power and p-value tend to start converging around this value or at least are at acceptable levels. This implies that the marginal benefit of increasing N_tp begins to significantly decrease at this point and the risk of error from adding to our sample size increases. As stated previously, we want to avoid N_tp being very large in order to limit possible error. Secondly, it’s a natural point as increasing our N size of simulated data to be greater than our actual customer data doesn’t seem intuitive.
The point of this process is to help SnackNation gain insight into the accuracy of our tasting panel. While the methodology described relies on certain assumptions that can potential skew results, when it’s factored into additional customer feedback metrics, our years of experience in the industry, and other qualitative data, it helps confirm trends we suspect or points us in a direction for further investigation. Ultimately, this problem that we face represents an application of statistics and hypothesis testing where data is irregular and doesn’t have a textbook solution. We’ve found a way that generally works for us using statistical intuition and experimentation to make sense of this problem. The Business Intelligence team is dedicated to tackling these kinds of analytics challenges as we’ve found that a tremendous amount of value can be gained for our stakeholders through these efforts.