CSAT: An emperor with no clothes?

Ron Sielinski
Data Science at Microsoft
12 min readApr 22, 2021

CSAT is the king of customer satisfaction surveys. It’s the go-to metric that companies use to measure customer satisfaction for a variety of experiences: online searches, product purchases, customer support, and more.

Because of CSAT’s popularity, numerous how-to guides describe how the survey works and how the results are calculated. But the guides often stop there. They fail to point out that, because CSAT is a survey, companies need to account for uncertainty in their results. Statistics gives us the necessary tools to deal with that uncertainty — to put results into perspective — but the how-to guides often overlook (or avoid) that step in the analytical process, making it seem optional.

It’s not.

Unless companies account for uncertainty in their CSAT scores, they risk wasting time chasing meaningless period-over-period changes, investing in ineffectual programs, and missing real opportunities to improve customer satisfaction.

How CSAT works

There are two way of calculating CSAT, as a score or as a percentage. Both start with a simple survey asking how satisfied customers are with an experience. Possible answers range from least to most satisfied, creating a familiar five-point scale:

1. Highly dissatisfied
2. Dissatisfied
3. Neither satisfied nor dissatisfied
4. Satisfied
5. Highly satisfied

The more satisfied the customer, the more points.

At Microsoft, we sometimes present the survey as a set of stars, smiley faces, or other easily interpretable signs of satisfaction. I recently sent my wife’s Surface Laptop in for service. After the laptop was repaired, the last email that I received from Microsoft included this survey:

Surveys like this reduce the cognitive load of having to read the five options and making nuanced distinctions among them. Rather than having to decide, “What’s the difference between satisfied and highly satisfied?”, customers can simply rank their level of satisfaction on a scale of 0 to 5. In this case, the more satisfied the customer, the more stars.

CSAT score

The first of the two ways to calculate CSAT is simply to take the average of all survey responses. The following example shows how CSAT is calculated for a survey that received 10 responses: 1, 5, 3, 4, 5, 2, 1, 4, 3, 4.

CSAT = (sum of CSAT scores) / (number of responses)
CSAT = (1 + 5 + 3 + 4 + 5 + 2 + 1 + 4 + 3 + 4) / 10
CSAT ‏‏‎= 32 / 10
CSAT ‏‏‎‏‎= 3.20

CSAT percentage

The second way to calculate CSAT is to group scores together as high (4 – 5) or low (1 – 3). In fact, some companies reduce the survey itself to a simple choice between thumbs-up or thumbs-down:

High scores (thumbs-up): 4, 5
Low scores (thumbs-down): 1, 2, 3

Here’s the survey that appears on docs.microsoft.com:

The CSAT percentage is simply the number of surveys that received a high score (or a thumbs-up), divided by the total number of responses:

CSAT = (number of high scores) / (number of responses)

Using the same example from above, we first transform our 10 responses from scores to 1s or 0s:

P: {1, 5, 3, 4, 5, 2, 1, 4, 3, 4}
P: {0, 1, 0, 1, 1, 0, 0, 1, 0, 1}

Then calculate the percentage:

CSAT = 5 / 10
CSAT = 0.50

In either case — score or percentage — the results of these calculations aren’t as exact as they might seem. More formally, they’re only point estimates of customer satisfaction.

Unfortunately, this is all the further that most companies take their CSAT calculations.

Confidence intervals

It’s important to remember that a CSAT score is based on a subset of a population — just those customers who’ve responded to the survey — but is intended to represent the satisfaction of the entire population. Because not every customer responds to the survey, the results need to account for a degree of uncertainty.

Consider an extreme example: If a company had 100 customers complete a purchase, but only a single dissatisfied customer responded to the survey, the resulting CSAT would be low, regardless of the satisfaction of the other 99 customers.

To overcome that challenge, CSAT should always be reported with confidence intervals, representing the range in scores that will likely contain the population’s true CSAT based upon the number of responses and the specified probability.

If you’re a bit apprehensive about statistics but would like to calculate confidence intervals yourself, try an online calculator like this one on Bing. (Because CSAT is the mean score of all survey results, select “Mean” as the type of confidence interval that you want to calculate.)

Otherwise, this formula is used to calculate the confidence interval (CI) for CSAT scores at a specified confidence level (1 – α):

where

= mean of the sample
t_(α/₂, df) = critical value of t at the desired confidence level and relevant degrees of freedom
α = 1 – confidence level
df = degrees of freedom
σ_ẋ = standard error of the mean

Some of these variables might look daunting, particularly α and t, but α is simply the complement of a confidence level that you get to choose, and t is a look-up value based upon α and the degrees of freedom (in this case, n – 1): See “Critical Values of the Student’s t Distribution” at NIST.gov.

For the sake of simplicity, this article focuses on CSAT scores, but the same approach is used for CSAT percentages. The only difference is how to calculate the standard error of the mean for a continuous value (σ_ẋ) versus the percentage of a population (σ_ṗ):

where

s = standard deviation of the sample
n = size of the sample

The following shows how the confidence interval is calculated for the same 10-response example from above at a 95% confidence level (i.e., α = 0.05):

CI_₀.₉₅ = ± t_(α/₂, df) * σ_ẋ
CI_
₀.₉₅ ‏‏‎= ± t_(₀.₀₅/₂, ₉) * (s / √n)
CI_₀.₉₅ ‏‏‎= 3.20 ± t_(₀.₀₂₅, ₉) * (1.48 / √10)
CI_₀.₉₅ = 3.20 ± 2.26 * (0.47)
CI_₀.₉₅ ‏‏‎= 3.20 ± 1.06
CI_₀.₉₅ ‏‏‎= [2.14, 4.26]

The result provides a very different perspective on CSAT. Instead of the false precision of the score taken to two decimal places (3.20), the range represented by the confidence interval (i.e., |2.14 – 4.26| = 2.12) spans over half the Likert scale (|1 – 5| = 4, and 2.12 is more than half of 4), calling into question that false precision.

Figure 1: The raw CSAT score is 3.20, the confidence interval spans 2.14 – 5.26, and the Likert scale spans 1 – 5

In this case, the primary reason that the interval is so wide is that it’s based on such a small sample, only 10 responses. The law of large numbers tells us that the more responses we get, the closer the point estimate will be to the real CSAT of the overall population and the more narrow the span will be. If our example survey had received 1,000 responses, but the results had the same mean (3.20) and standard deviation (1.48), the CI would be ± 0.09 (instead of ± 1.06 for just 10 responses).

Don’t presume that a survey with just 10 responses is merely an academic example, however. Results based on a small number of responses are common, especially when companies analyze their CSAT scores. Quite often, they’ll slice aggregate scores by various factors, such as support topic or product category, trying to isolate areas where customer satisfaction is higher or lower. With each slice, the sample sizes get smaller and confidence intervals get wider.

Regardless of their size, confidence intervals are important: They remind people that CSAT is an estimate, not an exact score.

Finite population correction

CSAT is designed to measure customer satisfaction with a specific transaction, and companies often know how many customers completed that transaction. If the response rate for a survey is 5% or higher, companies should apply the finite population correction (FPC), which accounts for an increase in accuracy when a relatively large proportion of a population is sampled:

where

N = size of the population
n = size of the sample

The formula for the standard error of individual CSAT scores becomes:

σ_ẋ = s / √n * FPC
σ_ẋ = s / √n * √[(N
n) / (N 1)]

And the formula for confidence intervals becomes:

CI_(₁ – α) = ± t_(α/₂, df) * σ_ẋ
CI_(₁ – α) = ± t_(α/₂, df) * s / √n * √[(N n) / (N – 1)]

Period-over-period comparisons

Oftentimes, companies compare CSAT scores on a period-over-period basis to see how customer satisfaction is changing over time. Period-over-period perspectives are especially useful when companies want to track salient changes in trends or they’re actively trying to increase their customers’ satisfaction.

The challenge, however, is knowing when there’s a meaningful difference in CSAT scores from one period to the next. Just because one CSAT score is larger (or smaller) than another doesn’t mean that there’s a real difference in the satisfaction of the two populations.

Consider the case where a company is trying to improve its CSAT score. The company tries multiple programs and initiatives before finally seeing their CSAT increase from 2.87 to 3.80. The difference of 0.93 seems significant, but unless the company evaluates the difference in scores from a statistical perspective, they cannot know. If the improvement is real, they can justifiably celebrate their success and invest more heavily in the new initiative. If not, any further investment in that initiative is probably wasted, and continuing down that path would result in missed opportunities to further improve customer satisfaction.

Here are the survey results from the two consecutive periods, P₁ and P₂:

P₁: {1, 3, 4, 5, 4, 4, 2, 3, 2, 3, 1, 2, 4, 3, 2}
P₂: {4, 5, 4, 5, 2, 5, 3, 4, 4, 5, 2, 3, 5, 3, 3}

The resulting CSAT scores with confidence intervals are:

P₁: 2.87 ± 0.66
P₂: 3.80 ± 0.60

While the 0.93 increase from P₁ to P₂ seems significant, the confidence intervals for both periods overlap, making it difficult to determine whether the difference is statistically significant: If both scores can lie in the 3.20 – 3.53 range, perhaps there’s no real difference between periods.

Figure 2: Overlapping confidence intervals around P₁ and P₂

However, we can’t rely on confidence intervals around the individual scores to compare them. We need to account for the additive nature of errors (i.e., the combination of the error around P₁ and the error around P₂). The pooled standard error is the quadrature of the individual surveys’ standard errors:

Much like the difference between Euclidean and Manhattan distance, the pooled error is smaller than the sum of individual errors:

σ_pool = √ [0.31² + 0.28²]
σ_pool ‏‏‎= 0.41

As a result, the confidence interval for the difference in CSAT scores is likewise smaller.

We start by calculating the difference itself:

ẋ_diff = | ẋ_p₁ ẋ_p₂ |
ẋ_diff = | 2.87 – 3.80 |
ẋ_diff = 0.93

The formula for confidence interval remains the same, but t needs to account for the fact that the two CSAT scores are based on two surveys, so the number of degrees of freedom increases to (n_p₁ – 1) + (n_p₂ – 1) = 28:

CI_₀.₉₅ = ẋ_diff ± t_(α/₂, df) * σ_pool
CI_₀.₉₅ ‎= 0.93 ± 2.05 * 0.41
CI_₀.₉₅ ‏‏‎= 0.93 ± 0.85
CI_₀.₉₅ ‏‏‎= [0.08, 1.78]

If the resulting confidence interval contains 0, the difference in CSAT scores is not statistically significant (i.e., if the interval could be 0, it’s possible that there’s no real difference). Conversely, if the interval doesn’t contain 0, the difference is statistically significant (i.e., if the interval can’t be 0, there must be a real difference).

In this example, the confidence interval (0.08 – 1.78) does not contain 0, so the difference in the CSAT scores is significant.

Multiple comparisons

The technique for period-over-period analysis is really only valid when comparing one survey to another. To compare three (or more) surveys, we’d have to compare each of the surveys with the others: S₁ to S₂, S₁ to S₃, and S₂ to S₃. Because each pairwise comparison has a 95% confidence level, the uncertainty in the overall analysis multiplies rapidly: 95% * 95% * 95% = 86%. As a consequence, the probability for an error — the complement of confidence level — nearly triples, going from 5% to 14%!

Nonetheless, the need to make multiple pairwise comparisons is real. For example, a company might want to evaluate its all-up CSAT score across 12 product categories, hoping to identify the categories where customer satisfaction is especially high or low. But k product categories require ∑i comparisons (for i = 1 to (k – 1)), so an analysis that subsets CSAT into 12 product categories would require 66 pairwise comparisons. The result would not be tenable.

A Tukey HSD (Honestly Significant Difference) test offers one solution. The test evolves from the same family of inferential statistics as the preceding techniques, so the concepts are fundamentally similar, but it controls for the errors that arise when multiple comparisons are made. When the number of surveys in each category is different (nᵢnⱼ), which is typical, the test uses the Tukey-Kramer method to calculate confidence intervals:

where

q_(α, k, df) = critical value of q, the studentized range distribution
α = 1 – confidence level
k
= number of groups
df = degrees of freedom (Nk)
N = total sample size
MS𝓌 = mean within-group sum of squares (∑∑(x )² / df)
n
= number of surveys in a group
i, j = 1, … , k (ij)

The test calculates the difference between every pair of scores as well as a confidence interval. As before, if any of the resulting confidence intervals contains 0, the difference in that pair of scores is not statistically significant.

Ordinal data and truncated intervals

Finally, this article has focused on the techniques most commonly used for surveys principally because they are the techniques most commonly used for surveys, which should help with the initial adoption of confidence intervals for the analysis and reporting of CSAT results.

There are multiple alternative ways of calculating confidence intervals, however, including some that address shortcomings of these more familiar approaches. The most obvious of these shortcomings occurs near the upper and lower limits of the Likert scale. Because CSAT surveys are based on fixed ordinal scales (e.g., 1–5, 0–5, and so on), scores cannot exist outside of those ranges. The familiar approaches don’t enforce that constraint, so their CI formulas can return intervals that extend slightly above or below the allowable range. From a practical perspective, though, it’s sufficient to simply truncate the CI at the upper and lower limits of the scale.

The foundation of good analysis

The simplicity of CSAT is deceptive. The survey itself is relatively easy to implement, and raw results are easy to calculate, but interpreting those results requires a much more careful approach.

With any survey, there are multiple issues that companies might need to consider: participation bias, geographic bias, survey fatigue, and more. Most of these vary depending on the context of the survey, but one thing that’s always true is the need to account for uncertainty in the results. Calculating confidence intervals is essential to putting survey results in the right perspective.

Ultimately, the emperor does have clothes: confidence intervals. Without them, though, CSAT is just as naked as the proverbial king and, frankly, not something you should really look at!

Ron Sielinski is on LinkedIn.

Want to know more about how you can set up your Machine Learning projects for success? Read our recent article:

--

--