# At least two reasons why you probably shouldn’t use the Net Promoter Score

--

Update: For those who wish to know how to find confidence intervals for the Net Promoter Score, I’ve written a guide complete with a downloadable spreadsheet or R code for doing just that. Hopefully someone out there can use that information to improve their NPS reporting with some measures of statistical confidence.

People love the net promoter score. According to the Wikipedia article on NPS, two thirds of Fortune 1000 make use of it somehow. If you don’t know about it, NPS is a survey-based management tool meant to quantify customer satisfaction and loyalty. The folks who sell the NPS over at netpromoter.com claim to have proven that it is a predictive indicator of growth, so that if your NPS is high, your company will grow soon.

The NPS has essentially two parts: a questionnaire, and a summary statistic. The questionnaire is simple and elegant and I love it. It contains only a single question, which is officially worded as follows: How likely is it that you would recommend [brand] to a friend or colleague? Responses are given on a scale from 0 to 10, with 0 being not likely at all, and 10 being very likely.

If you read my guide to survey design, you’ll see why I love the questionnaire so much. We are trying to quantify customer satisfaction, so we find an action tendency which is reflective of customer satisfaction — recommending the brand to a friend or colleague — and then ask about that. The resulting distribution of responses ought to give us a well calibrated reflection of how satisfied the population is with our company.

The second part of the NPS is the summary statistic, and this part is a little bit more problematic. To compute the NPS, we label respondents who give a 9 or a 10 as “promoters”, and respondents who give a 6 or below as “detractors”. Then the NPS statistic is computed as the proportion of promoters minus the proportion of detractors. Officially, this is multiplied by 100 in order to express the result as a percent. The NPS statistic falls between -100 and 100, with a score of -100 signifying that everyone in your sample is a detractor, and a score of 100 signifying that everyone in your sample is a promoter. A score greater than 0 signifies that there are more promoters than detractors, and vice versa.

A lot of aspects of this scheme seem a little arbitrary — why 11 points? why 6 as a cutoff? why 9 as a cutoff? why a difference in proportions and not a ratio which would seem more natural for proportions? — but I’m going to take those as given here. My main concern is with some undesirable statistical properties of the NPS statistic. Because of the way that it is defined, the NPS statistic has (at least) two main statistical shortcomings that make it a poor choice of KPI for many applications. The first shortcoming that I’ll discuss is at the population level. I’ll show why if you could survey every single member of your customer base with zero error and compute the NPS statistic of that, the result could be misleading, especially if your plan is to use this result to compare across groups of customers or across time. The second shortcoming is at the sample level. It turns out that the NPS has significantly higher sampling variability than alternatives, and this has some unfortunate consequences for its usefulness as a KPI.

# NPS doesn’t capture meaningful population-level differences

Since NPS is a commonly used indicator for customer satisfaction, it is frequently used to measure the success of customer satisfaction initiatives. You may wish to determine whether intervention X increases customer satisfaction, so you partition some portion of your customer base into a treatment group and a control group, and measure the NPS statistics of each group after the intervention. Unfortunately, no matter how scientifically this procedure is carried out, you may be unable to detect changes in customer satisfaction due to X because the NPS statistic itself is quite ineffective at detecting differences between distributions.

For example, since the net promoter score treats respondents who give a response between 0 and 6 to be the same (as detractors), any change in the distribution of these responses is undetectable by NPS. Similarly, changes within the group that respond 7 or 8, as well as the group that responds 9 or 10, are also undetectable.

This is intuitively obvious to most folks who really think about how NPS is computed, but the severity of this shortcoming as a KPI is often overlooked. Figure 1 shows three distributions, each with equal NPS statistics. By almost any other measure that is commonly used to characterize distributions, customer satisfaction as a whole is worst in group 1, a little better in group 2, and much better in group 3.

Now imagine that you were trying out some new initiative to improve customer satisfaction, and using the NPS as your main KPI to track the effectiveness of this measure. Imagine further that when you begin the initiative, your customer base looks a lot like Group 1, but it moves to something a lot like Group 3. This ought to be a resounding success! In Group 1, for instance, about 65% of your customers give a score of 4 or less; this number is reduced to only about 35% in Group 3. But if you are using the NPS as your main KPI, you will remain tragically unaware of this success.

And it is not just movements within the detractor or promoter groups that NPS cannot detect. Figure 2 gives another Anscombe’s trio of customer satisfaction scores, each with very different shapes. A business which is trying to maximize customer satisfaction should find different things to worry about in each of these graphs, but the NPS scores are again all equal — this time, all zero.

Any summary statistic necessarily throws out a lot of information about the distribution that it summarizes, but the NPS throws out a lot, much of which would not be lost using other similar summary statistics. It does have some desirable properties as a very high level KPI — in particular, if it moves up then that’s probably good and if it moves down that’s probably bad — but these properties are not unique to the NPS statistic, and other summary statistics have these properties without throwing out so much useful information about the data. Moreover, the converse does not hold: if the NPS statistic goes down then that’s probably bad, but if something bad happens, that does not necessarily mean that NPS will go down. Again, referring to figure 1, if something takes us from Group 3 to Group 1, that change will not be reflected by NPS, even though it is a fairly unequivocally bad move.

Most importantly, NPS is a horrible tool for measuring the difference between two or more groups (or a single group across time). Two distributions can be drastically different in ways that have statistical and practical importance, but have essentially equal NPS statistics.

# NPS requires much larger sample sizes than alternatives

The last point deals with NPS as a population level, but in practice, we can’t observe NPS at a population level. The NPS statistic is computed at a sample level, from survey responses. In general, we cannot include everyone in the population in a survey, so the real point of the NPS statistic is to make an inference about the population by looking at a sample.

Everyone knows that when you compute a sample statistic, you should also report some measure of your confidence of how closely your estimate reflects the underlying population statistic that you’re trying to find. Unfortunately, this seems to go out the window a little bit when we talk about NPS — for instance, at the time of this writing, the official Net Promoter Score website doesn’t seem to have any mention of “confidence intervals” or the like anywhere.

This is not to say that we can’t derive a confidence interval for the NPS statistic. In fact, deriving standard errors for the NPS statistic is surprisingly straightforward, and anyone using anything at least as powerful as Excel should have no problem producing standard errors and confidence intervals and p-values and whatever else we normally report along with sample-based estimates of population statistics.

So how come nobody does it? Well, I’m not sure, but I think that a contributing factor may be that confidence intervals for NPS estimates are quite large compared to alternative methods. There are countless methods in the literature for comparing differences between distributions of ordinal categorical data — in fact, here is an entire 400+ page book about nothing but that if you’re interested — and they all almost always have better statistical properties than the NPS statistic.

It is somewhat difficult to quantify how much better alternative methods perform compared to the NPS statistic without making a whole bunch of assumptions about the underlying data, but there are a few things that we can figure out. For one, the standard error on the NPS statistic must be quite a bit larger than the standard error of either an estimate of the proportion of promoters, or the proportion of detractors. Why? It takes just a little bit of very simple statistics to show, but if you don’t feel like reading that, feel free to skip down to the next figure.

If we let P be the proportion of promoters in the population, and D be the proportion of detractors, then the population NPS is equal to P - D — this is simply by the definition of NPS. Now, the standard error of a statistic is derived from its variance, and since NPS = P - D, we must have that Var(NPS) = Var(P - D). Remembering back to college statistics, the variance of a difference of random variables is equal to the sum of their variances, minus an extra covariance term. Applying this formula to the NPS, we have that

Var(NPS) = Var(P) + Var(D) - 2Cov(P,D).

Finally, note that Cov(P,D) < 0, since a promoter is necessarily not a detractor, and vice versa. So we end up with

Var(NPS) = Var(P) + Var(D) + some positive number

In other words, the variance of the NPS statistic is greater than the sum of the variances of its parts. So, forget about all the fancy ordinal regression methods from the book — you can do a lot better than NPS by simply focusing on either the promoter proportion or detractor proportions on their own.

Figure 3 presents the results of a simulation to investigate this empirically. At each sample size, I generated scores from a modified normal distribution and computed confidence interval widths for each statistic. As expected, the width of the confidence intervals for the NPS statistic are consistently around twice the widths of the confidence intervals of the detractors or promoters proportions, and flipping that around, the sample sizes required to compute statistics to a fixed precision are enormously larger for NPS than for the others. For instance, if we want our estimates to within 10 percentage points (so that the width of the confidence interval is 20), we only need somewhere between 300 and 400 participants to estimate detractors or promoters, but we need about 1000 participants to estimate NPS. Note that these exact numbers will vary depending on the real distribution — this is just a simulation — but the relationships between these summary statistics generally won’t change so much. You’ll almost always need more than twice as many people to estimate NPS as to estimate the promoter and detractor percentage.

## A corollary: you should expect your NPS to swing wildly, even when nothing is changing at the population level

Since the variance in the NPS statistic is larger than that of other measures, it follows that repeated samplings from the same population will yield more variable NPS statistics compared with other statistics. Figure 4 gives the result of another simulation to illustrate this. This time, we sample a group of 250 from the same population 12 times, to simulate a monthly survey with 250 respondents. For each month, we compute the NPS, as well as the Promoters %, Detractors %, mean ranking, and median ranking of the sample. The results are transformed to percent misses — that is, to the percent by which the estimate misses the true value.

The headline result is this: the NPS estimate regularly misses its true value by upwards of 30%. On the other hand, simply reporting the mean ranking never misses the true value by even 5%, and the median ranking is actually bang on every single month.

Again, like the last simulation, your mileage will vary depending on the characteristics of the distribution that you’re drawing from, but the relationships between the statistics will stay about the same. Your NPS might not regularly swing by 40% — or, it might swing by way more than 40% — but it will almost always have more violent swings than any of these other statistics, and the mean and median rankings will almost always have much more gentle swings from month to month than the NPS-related statistics.

# The questionnaire is still good, but we can do a lot better than the NPS statistic

As I mentioned in the introduction, I think that the NPS question is still a very useful tool for measuring customer satisfaction and loyalty. But the NPS statistic has very undesirable statistical properties that make it a poor choice for analyzing the results of that survey. This is especially true when the survey is being used to compare two groups, or to compare the same group across time. Meaningful differences can develop between groups that are not detectable by the NPS statistic, and even when differences are detectable in principle, the sample sizes required to detect them using the NPS statistic are often infeasible — or if they aren’t infeasible, they’re still much larger than the sample sizes required by alternative measures.

There is an entire world of options for analyzing the results of an NPS-type questionnaire: ridit scoring, ordinal logistic regressions, rank-sum tests, the list goes on. It is also OK to use simple means and medians of the scores to analyze the data. A lot of people think that since the results of an NPS-type questionnaire come back as discrete data as opposed to continuous data, looking at the mean score is “not allowed”. But it turns out that under some sensible assumptions about the underlying distributions, using mean scores (and related methods such as linear regressions) to compare across groups or across time is perfectly fine — and the assumptions required for mean scores are a whole lot easier to take than the assumptions required for the NPS statistic (e.g. that anyone who gives a 6 or lower is a “detractor” etc.).

The most important thing to remember when analyzing the results of an NPS-type survey is that you are trying to characterize a distribution of responses. Any attempt to summarize the distribution by a single number — an NPS statistic, a mean score, whatever — will necessarily be reductive and incomplete. While the NPS statistic is somewhat more incomplete than others, any analysis should at least include a more holistic look at the distribution of responses, such as a histogram, in order to get a picture of the entire distribution.

NPS is basically a good idea. The question is a good way to get data on customer satisfaction, and thinking about the distributions of promoters and detractors is probably a useful abstraction for understanding the overall distribution of responses. Unfortunately, the NPS statistic itself has poor statistical properties for a KPI, and other methods are a lot better for comparing and tracking across time.