ML and customer support (Part 2): Leveraging topic modeling to identify the top investment areas in support cases

Yael Brumer
Data Science at Microsoft
11 min readJul 13, 2021

By Yael Brumer and Sally Mitrofanov

Support is a critical part of a successful business model because it helps keep customers happy with the service or product they are receiving. It also helps customers feel they have somewhere to turn when things don’t go as expected. In the first article of this two-part series on using Machine Learning to enable world-class customer support, Alexei Robsky and colleagues introduced some of the most important metric-based elements involved in delivering great support, including factors involved in prioritizing customer support calls and in measuring customer support success. The article discussed some of the challenges involved in relying on customer surveys to gauge customer support success and introduced some thinking behind creating an ML model that can be used to help enable customer support. In this article, we describe how we run natural language processing techniques to prioritize engineering efforts to overcome problems identified by customers.

Support is often the only place where an organization can hear the voices of its customers. As a result, gaining useful insights from customer support data can be especially valuable to an organization. Beyond using this data merely to reduce the COGS (cost of goods sold) of delivering support, also using it to understand customers’ pain points on products or offers can lead directly to finding ways to close gaps that cause customer-facing issues.

Finding the data that drives key performance indicators (KPIs) related to support through a manual process is challenging, with natural built-in limits. The infeasibility of manually analyzing a large text dataset to understand and gain meaningful insights points to the need for a scalable topic modeling solution. Our aim in introducing one is to empower feature teams to better align their resources and efforts around developing and prioritizing initiatives with impact, by reducing the need for support in the first place.

The primary goal of our solution is to group or cluster cases by their semantic similarity and then let the product group review only the top N cases from each cluster. In addition, by providing topic trends month over month, our model helps to measure product investments and detect emerging topics for further attention.

We have designed the solution described in this article to help product owners do the following:

  1. Take a more proactive approach to improving the customer experience.
  2. Prioritize and better align feature team resources and efforts to solve issues before they generate support calls.
  3. Track progress and validate whether actions taken by feature teams have helped or not.

There are many solutions for the problem described, and we discuss two of them.

Traditional term frequency–inverse document frequency (TF-IDF) with LDA clustering

In our first solution, we concentrate on one large area of specific support cases using Latent Dirichlet Allocation (LDA) for topic extraction. Combined with additional support ticket details, this model is live and already helping some teams prioritize their supportability efforts.

Preprocessing

In the text preprocessing stage, we distill topic information from support tickets by removing metadata while retaining only the core issue. Next, we use a custom Azure-specific text processing package to tag Azure terms. Finally, we stem and lemmatize text, remove stop words, and tokenize the remaining text to be used as input to the LDA model.

Topic extraction

LDA assumes that all documents within the corpus belong to a predetermined number of latent topics. It uses probabilities to estimate the topic distribution for each document based on keyword distributions within each topic and document. The topic with the highest probability becomes that document’s dominant topic. Topics themselves are the most representative keywords associated with each cluster.

The quality of LDA topics depends heavily on how thoroughly the text is cleaned and on the number of topics specified (k). Two measures are used to evaluate the LDA model: perplexity and coherence. Perplexity measures the log-likelihood of observing some new data given the existing model. We can use it to ensure our result is reproduceable. Coherence measures how well the keywords within each topic support each other and belong to the same semantic context. When choosing the number k, we focused on optimizing those two measurements using the elbow method. In addition, we used business stakeholders to evaluate the quality of the output. Figure 1 helps visualize topic output from the LDA model.

Figure 1: Visualizations of output from the LDA model

Outputs of the LDA model include topic/cluster keywords, dominant topic for each document, most representative document (support ticket) per topic, word clouds associated with each topic, and monthly trends of tickets created within each cluster. In addition, our stakeholders wanted to see other ticket metadata to help them narrow down the issue within each topic.

Pros of using LDA

Using LDA allowed us to iterate quickly on producing a workable model. Because it’s very easy to implement, we could focus on drilling into a specific business domain to produce the most actionable results for our stakeholders. Additionally, available LDA packages like Gensim allowed us to preserve the model and use it on unseen cases to make tracking improvements and new issues easier.

Cons of using LDA

The downsides of the traditional methods embodied in this approach relate to preprocessing and hyperparameter tuning. It takes time to develop enough domain knowledge to understand how to clean the text properly, and what the appropriate number of topics (k) should be.

Transfer-based models: BERT

During our analysis, we noticed that users might express experiences with similar issues using different words. For example, one user might say, “I want to get more familiar with Azure,” while another user might say “I want to learn Azure.” These two users are seeking for a similar goal, but with more traditional approaches we might miss these nuances. To overcome this, we proposed and tried an alternative solution with transfer-based models.

Transfer-based models, and more specifically models based on BERT, have shown noteworthy results in various natural-language processing (NLP) tasks over the last few years. Transfer learning is a technique whereby a Deep Learning model, pre-trained on a large dataset, is used to perform similar tasks on another dataset.

Figure 2 shows the high-level end-to-end system:

Figure 2: End-to-end workflow

Text preprocessing

First we lowercase support ticket description text and clean it to remove metadata while retaining only the core issue. Next, we remove domain-specific words and feed the cleaned text into the BERT pre-trained model to obtain the feature vector. For the purpose of this work, we used DistilBERT because it’s a lighter and faster version of BERT that roughly matches the larger model’s performance, as explained in Jay Alammar’s post.

Clustering

At this point, we have transferred the cleaned text to high-dimensional numeric vectors using pre-trained embedding model. However, it’s recommended to lower the dimensionality of the embeddings, as many clustering algorithms handle high dimensionality poorly. Dimension reduction allows for dense clusters of documents to be found more efficiently and accurately in the reduced space.

Out of the few dimensionality reduction algorithms, Uniform Manifold Approximation and Projection (UMAP) has some significant wins in its current incarnation. UMAP is like t-SNE, but faster and more general purpose. Researchers have found that t-SNE does not preserve global structure as well as UMAP and it does not scale well to large datasets. The advantage of UMAP is that it preserves a significant portion of the high-dimensional local structure in lower dimensionality. The most important parameter to fine-tune is the number of nearest neighbors, which controls the balance between preserving global structure versus local structure in low-dimensional embedding. Greater numbers of nearest neighbors put more emphasis on global over local structure. Because the goal is to find dense areas of documents that would be close to each other in the high-dimensional space, local structure is more important in this application. We experimented with different values and found that setting this value to 15 gives the best results in terms of the silhouette score. Another related parameter is the distance metric, which is used to measure the distance between points in the high-dimensional space. The most common distance metric for the document vectors is cosine similarity because it measures the alikeness of documents regardless of their dimensionalities. As a last step, the embedding dimension must be chosen; we found that a value of five dimensions gives the best results for the downstream task of density-based clustering.

After having reduced the dimensionality of the support cases embeddings to five, we can cluster the documents with HDBSCAN: Hierarchical Density-based Spatial Clustering of Applications with Noise is a density-based algorithm that works well with UMAP because UMAP maintains a lot of local structure even in lower-dimensional space. One nice advantage — it’s fast and robust to noise. HDBSCAN does not force data points into clusters; it allows the existence of outliers, which is meaningful if we know that we have actual noise in the data, as in our case. Digressing for a moment, in k-means clustering, every data point must be in a cluster. Moreover, one assumption for k-means involves the number of clusters that must be specified — but it is frequently not possible to choose a good value for k. HDBSCAN is a clustering algorithm that overcomes many of the limitations of k-means. The main hyperparameter that must be chosen for HDBSCAN is the minimum cluster size. This parameter represents the smallest size that should be considered a cluster by the algorithm. We have found that a minimum cluster size of 15 provides the best results in our experiments, as larger values have a higher chance of merging unrelated document clusters.

Topic generation

We leverage an existing method called Class-based TF-IDF (c-TF-IDF) to generate the topic representations. Traditional term frequency–inverse document frequency (TF-IDF) is a method for generating features from textual documents that is the result of multiplying two methods: Term Frequency (TF) and Inverse Document Frequency (IDF). c-TF-IDF is very similar to TF-IDF, but in the case of the former, we join all the documents in the same class together, making the result a very long document. This allows us to start looking at TF-IDF from a class-based perspective. Because we’ve merged the documents, c- TF-IDF takes the number of classes instead of the number of documents. tᵢ is the frequency of a word t in class i. Then, we divide this frequency by the total number of words in this class. Next, we divide the total number of unjointed documents m by the total frequency of word t across all classes n, yielding the following equation:

Equation 1: C-TF-IDF

The model in production

Various product and service teams among our stakeholders are reviewing the recommendations for the clusters and enabling engineering teams to act upon the recommendations. As a result of our models, the product teams are able to prioritize bugs and issue fixes that map to customer priorities. For example, one of the topics that came up involved improving documentation around properly deallocating a service to avoid future charges. Additionally, our approach identified several other issues to improve the customer experience, and we are now working with various service and marketing teams on those issues. We will then validate that the changes are solving customer problems.

Below is a screen shot of the cluster prioritization dashboard we’ve built to productize the model. Clusters are prioritized according to monthly support request volume and trends within each cluster. Another page provides a more detailed look along with metrics relating to issues and customers.

Pros and cons: LDA versus transformer-based models with HDBSCAN

In this section, we compare the two approaches outlined above. Here is a table of how we have found each approach to perform in relation to the criteria specified.

LDA uses bag-of-words (BOW) representations of documents as input, which ignore word semantics. LDA has shown to be a good starting point, but it takes quite some effort through hyperparameter tuning to create meaningful topics.

Also, in terms of the clustering algorithm, HDBSCAN does not force data points into clusters, because it considers them outliers — and this is important in our scenario because we have some noise in the data. The HDBSCAN library supports the GLOSH outlier detection algorithm, which can detect outliers that may be noticeably different from points in its local region. In k-means, every data point must be in a cluster. One of the assumptions for k-means is the number of clusters that must be specified — but it is frequently not possible to choose a good value for k. The main hyperparameter that must be chosen for HDBSCAN is minimum cluster size. This parameter represents the smallest size that should be considered a cluster by the algorithm. As discussed earlier, we have found that a minimum cluster size of 15 provides the best results in our experiments, as larger values have a higher chance of merging unrelated document clusters.

Summary

In this article, we have demonstrated how we generate meaningful insights from a business problem in the support area. Product groups can use these insights to improve the customer experience and fix issues that arise in the product. Part of this project’s success stems from us working closely with our product teams. Bringing product teams into the process from the first stages is essential, as they provide ongoing feedback that helps us consistently improve model performance.

From the standpoint of the model, we have found it helpful to start with the more traditional approaches, such as LDA, which have allowed us to quickly iterate with stakeholders and produce a workable solution to make a direct impact on the business. Because we faced some issues when we tried to scale out our approach and adjust the model to other product teams, using transfer-based models, such as BERT, was helpful.

LDA has been a good starting point for us, but it takes quite some effort through hyperparameter tuning to creating meaningful topics. We have found that removing stop-words, lemmatization, stemming, and having a priori knowledge of the number of topics is not required if we leverage the transformer-based model (e.g., BERT) with the HDBSCAN clustering algorithm. That saved significant time for us.

The authors would like to thank Alexei Robsky, who also contributed to this article, and Ivan Barrientos, who reviewed it and provided helpful feedback.

To read the first article in this series, check out the following:

--

--