AI-Powered Jira Automation: Categorize Customer Support Tickets with Cluster Analysis

Tina Chenska
5 min readMay 3, 2024

--

Jira offers numerous solutions for ticket automation, primarily relying on straightforward JQL searches for categorization. However, when faced with a large volume of tickets from various external sources, I realized that traditional ticket categorization made the ticket collector messy.

Usually, if we ask customers to categorize tickets themselves by filling out detailed forms, it leads to many mistakes in labeling, and support agents have to re-label them manually in 80% of our cases.

This article aims to explore a more effective way to streamline this process, using a very simple solution that doesn’t require any sophisticated Jira extension or other third-party solution to be installed.

Who would find this article interesting: anyone involved in managing user feedback, especially those working with Jira projects, or those interested in the practical application of machine learning to enhance customer support efficiency.

What you need: Jira, a Python IDE or Jupyter Notebook (which can be hosted on Google Colaboratory).

First, you’ll need to prepare a set of tickets. You can export them from the top Jira instance menu in various formats. Personally, I prefer CSV, but you can choose the one that suits you best.

Then start your Jupyter IDE and setup imports. We’ll need a few libraries such as NLTK for language processing, pandas for data manipulation, and sklearn for the machine learning magic.

from google.colab import files
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('punkt')
nltk.download('stopwords')

uploaded = files.upload()

Convert the uploaded treasure map (CSV data) into a readable format (DataFrame):

csv_filename = list(uploaded.keys())[0]
data = pd.read_csv(csv_filename)
data.head()

Now, you will have access to all the fields from your Jira projects, though you may only need a few of them. In our case, the ‘Summary’ and ‘Description’ fields are the most critical. Let’s prepare our data for future processing:

  • Utilize only the ‘Summary’ and ‘Description’ fields.
  • Normalize the text by removing common English stopwords (such as “the”, “is”, “in”, etc.) because they carry minimal meaningful information for analysis.
  • Apply stemming to reduce words to their base or root form (for instance, converting “running” to “run”).
  • Convert all text to lowercase.
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_text(text):
if isinstance(text, str):
words = nltk.word_tokenize(text)
# Cut through the underbrush (tokenize, stem, and remove stop words)
words = [stemmer.stem(word.lower()) for word in words if word.isalnum() and word.lower() not in stop_words]
return ' '.join(words)
else:
return '' # In the absence of whispers, remain silent ('')

data['preprocessed_text1'] = data['Summary'].apply(preprocess_text)
data['preprocessed_text2'] = data['Description'].apply(preprocess_text)
data['combined_text'] = data['preprocessed_text1'] + ' ' + data['preprocessed_text2']

Now, when we have data with combined preprocessed text we can prepare for clustering.

TF-IDF transforms text documents into a matrix of TF-IDF features. In this matrix, each row corresponds to a document, and each column represents a specific word from our entire collection of text (corpus). For our purposes, we’ll cap the number of words to consider at the top 1000, based on their TF-IDF scores across all documents. This approach effectively reduces the dimensionality of our feature space, streamlining the computational process and making it more feasible, particularly when handling extensive text datasets.

tfidf_vectorizer = TfidfVectorizer(max_features=1000)
tfidf_matrix = tfidf_vectorizer.fit_transform(data['combined_text'])

In our test, we chose to organize the data into 4 clusters. You can experiment with different amounts, usually it depends on data types. We suppose, that all Users problems can be categorized into 4 big groups, our decision grounded in domain expertise.

The outcome is a clusters array, indicating each document’s cluster affiliation. To integrate these findings into our dataset, we added a ‘cluster’ column to our DataFrame, tagging each document with its respective cluster label based on the K-Means algorithm. This step segments the documents into 4 groups, each defined by its unique TF-IDF characteristics, bringing structure and insight into our dataset.

num_clusters = 4 # Decide on the number of clusters
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
clusters = kmeans.fit_predict(tfidf_matrix)
data['cluster'] = clusters

When tickets grouped into clusters, it’s time to delve deeper into each category to decipher the narratives they hold. On the first stage we can manually reviewing the content of the documents within each cluster. This phase helps us grasp the essence of each cluster, shedding light on the shared themes, topics, or patterns that bind the documents within a group. It’s offering a clearer understanding of the dataset’s dynamics.

for cluster_num in range(num_clusters):
print(f"Cluster {cluster_num}:")
cluster_docs = data[data['cluster'] == cluster_num]['combined_text']
for doc in cluster_docs:
print(doc)

You also can visualize the clustering results to better understand the distribution and characteristics of each cluster. A popular way to visualize clusters, is to use dimensionality reduction techniques to project the data onto a 2D or 3D space. This makes it possible to plot the data and see how the clusters are formed. For example you can use PCA or t-SNE:

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, n_iter=300)
reduced_features_tsne = tsne.fit_transform(tfidf_matrix.toarray())

plt.figure(figsize=(10, 7))
plt.scatter(reduced_features_tsne[:, 0], reduced_features_tsne[:, 1], c=clusters, cmap='viridis', s=50, alpha=0.6)
plt.title('K-Means Clustering Visualization with t-SNE')
plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
# No need to plot cluster centers for t-SNE due to the nature of the algorithm
plt.show()

Of course, K-Means is a great starting point. I recommend you try other clustering methods to explore how they will categorize tickets in your Jira projects. Accuracy may differ, and you may get other interesting clusters for labels.

--

--

Tina Chenska

Bug hunter by day. QA Engineer passionate about clear communication and building high-quality products.