ML and customer support (Part 1): Using Machine Learning to enable world-class customer support

Alexei Robsky
Data Science at Microsoft
8 min readJun 29, 2021

By Alexei Robsky, Joshua Chang, Yael Brumer, and Mark Filteau

Customers are at the center of every successful business, large or small. We want to make sure customers are happy with the service or product they are receiving and that they feel they have somewhere to turn when things don’t go as expected. One of the ways we help to achieve the latter part is through the customer support we provide.

It comes as no surprise that a poor support experience might lead to churn, downgrade, or something else that equates to a loss of business. For example, think back to when you last called your internet service provider about an issue you faced. You might have already been experiencing something negative with the service — leading you to call — and if the support agent didn’t help you feel good about the situation (regardless of whether the problem was solved or not), you might have been tempted to leave. We don’t want something similar to happen in customer support situations with us.

Support representative working on customer’s case.

In this first part of a two-part article series on using Machine Learning to enable world-class customer support, we introduce some of the most important metric-based elements involved in delivering great support, including factors involved in prioritizing customer support calls and in measuring customer support success. We discuss some of the challenges involved in relying on customer surveys to gauge customer support success, and we introduce some thinking behind creating an ML model that can be used to help enable customer support. In the second article, our colleagues Yael Brumer and Sally Mitrofanov describe how we run natural language processing techniques to prioritize engineering efforts to overcome problems identified by customers. But first, in this article, let’s set the stage with an overview of the problem statement.

The problem of considering attributes

Many factors can contribute to a poor support experience, starting with how they are prioritized for response in the first place. Customer support teams prioritize backlogged support cases in a variety of ways. Some use FIFO (first in, first out), while others take a more systematic approach, such as by looking at certain case attributes, including the age of a case, its severity, or the issue’s complexity. When done manually, these methods all attempt to “guess” which customers are having a poor support experience, based largely on the team or reviewers’ insights.

Relying only on human insight, however, tends to mean considering a fewer number of attributes, none of which might actually be the relevant ones. What if we are focused on the wrong ones? How do we really know which support case should be addressed first without looking at the dozens of attributes that could be relevant and compare them all to one another? Consider, for example, the following three customer support cases, each featuring just a few attributes (type of issue, recency, severity, and customer type).

Which of the cases above will end up with low customer satisfaction?

There is no easy answer. With minimal information of this type, we simply don’t know enough. That is where a scalable ML model can help, as we discuss shortly.

Measuring customer support success

Developing a way to rank support cases can help us solve this problem. Ranking involves measurement, the output of which is data. But understanding how we measure support success also allows us to create labels — in other words, informative tags about our data — that we can use in our ML model.

One way to measure the success of the support experience is to look at customer feedback provided via the survey that customers receive at the end of a support engagement. This is one of the few direct customer touchpoints we can get that is consistent across a variety of customer segments (such as enterprise, students, startups, and so on).

On the support survey, we ask customers to rank the support they experienced on a scale from 1 to 5 (with 1 being “very poor” and 5 being “very good”). As with any survey, we are not able to achieve a 100 percent response rate, but when looking at survey responses over a 12-month period, we have enough samples to analyze and understand the support experience.

Problems with relying on survey responses and how to mitigate them

Even with enough survey responses to analyze, there are still some challenges to account for, including the following:

  1. Responses may reflect customers’ experience with the product and not with the support experience. Some customers might feel frustrated due to a product issue, and even though they might have had a great support experience, they might still provide a lower rating on the survey.
  2. There is a tendency among respondents to answer with a single score regardless of the support experience. For example, some people answer with 3 for every question about support.
  3. Customers who submit survey responses tend to be the ones who had extremely positive or negative support experiences (a biased selection problem).
  4. Responses might be heavily skewed toward a score of 5, creating a non-normal distribution that makes drawing conclusions based on changes in average customer satisfaction more difficult.
  5. The customer’s geography and related cultural differences have an impact on how they rate the service they receive. For example, Americans tend to rate an average experience as a 5, whereas those in France tend to rate the same experience as a 3.
  6. Some customers might become confused on whether 1 is the lowest or highest score as they take the survey. Sometimes the user interface can be designed to mitigate this, but it can still occur.

There are many ways to overcome and mitigate these challenges, taking ROI into account. In some of the cases above, we have assumed that the distribution of error is spread evenly across the surveys, and so if we look at the average satisfaction score with confidence intervals and compare them month over month, we can focus on the overall improvement of the trend. In other words, if the average score is 4.5, we don’t really know if that’s reflective of the true population or not. If we scored 4.2 a few months ago, however, and also score 4.8 three months from now, the three scores show a positive trend that provides reassurance that we are trending the right direction.

A Machine Learning solution

To provide a more consistent, data driven prioritization process, we developed an ML model that combines several inputs about the case characteristics we’ve discussed and tries to estimate support tickets at risk of lower customer satisfaction.

As with most supervised ML approaches, we must define the label and input data. The survey response data is a sufficient indicator of the customer experience (with caveats that we touch upon in the next section). We can also supplement this with additional input data about what we know at the time of the support engagement; for example, customer information, support representative’s workload (i.e., their number of assigned tickets), ticket health (e.g., severity, duration, whether escalated, and so on), and product information (i.e., the product the customer is calling about).

Combining all the inputs, we can train an ML model with high accuracy and then make predictions on incoming cases. This technique enables support teams to focus on cases with a higher likelihood of customer dissatisfaction so that they can take corrective action to improve those experiences. The actions taken are then fed back (via feedback loop) to the model to reduce biases from interventions.

ML model architecture

Results

This ML-based approach has provided us with the following results:

Model results from back-testing

As is common when building ML models, we split the data into train-test and train-test-validation groups. We then used the test sets to perform cross-validation and performed hyperparameter tuning to validate the best model on the validation set.

We optimize for reducing false positives (i.e., promoting a case to be reviewed that doesn’t need to be reviewed) because they are costly, by balancing with a high Area Under the Curve (AUC) measure — a common performance measure for a binary classifier. We received a very high AUC (approximately 80 percent) and low false positives (about 15 percent).

Model results in production

As with all production models, the journey doesn’t end once the model is up and running. We strive to continuously improve our models over time, and we have a variety of feedback loops and processes in place to assist us. Stakeholder input, such as whether they agree with the predictions or manually flagging cases that should have been detected by the model, has already resulted in numerous enhancements to the model, usually in the form of additional features. We also periodically re-examine feature importance from the model to try and see whether the prediction results pass the “sniff test.” For example, if we see a particularly high percentage of cases with a high probability of dissatisfaction among a certain customer segment, we examine some of the top features, such as case age, to ensure that the inputs are telling a consistent story regarding the increased risk of this outcome.

Support success results — our ultimate goal

After running this model for some time, we can measure our success by looking at tickets with actions taken based on the model and compare them with tickets with no actions taken as a result of the model and see whether the survey distributions are statistically significant. For cases with actions taken based on the model, we see an improvement of about 10 percent in customer satisfaction survey results.

Next up in Part 2

One way we improve our products and services is by understanding common customer support topics identified through support cases. In Part 2 of this two-part article series, our colleagues Yael Brumer, Sally Mitrofanov, Alexei Robsky, and Ivan Barrientos describe how we run natural language processing (NLP) techniques to prioritize engineering efforts to overcome problems identified by customers.

To read the next article in this series, check out the following:

--

--