Natural Language Processing

Automatic Ticket Tagging with NLP Text Classification

Let’s do some linear NLP for IT Support area tickets in our company.

Julián Gutiérrez Ostrovsky
Hexacta Engineering

--

Disclaimer: If you’re thinking we want to replace IT Support people with an algorithm then no. We really don’t. We mean it.

Introduction

IT Support Area in our company receives dozens of tickets every day with different kinds of problems employees have every day. Tickets are filled by employees mainly with a title plus an issue description. There’s no control over those texts. Then a manual process of categorization based on tickets text is performed by IT people before proceeding to solve it. There are two different categories to predict:

  • Ticket Type (binary classification between Requirement or Incident)
  • Ticket Component (multiclass column with 8 different values)

Our goal will be then to use NLP techniques to perform text transformations and convert this task into a regular ML Classification problem in order to predict automatically these categories.

We will train two specialized models to achieve this.

Digging into tickets (EDA to know where we are)

Our dataset is in Spanish. Well, mainly in Spanish. There are a lot of technical words from this company’s business which is Software Development. So we have a lot of words in English too.

Also we can find some naughty tickets like these, which also include HTML tags:

╔════════════╦══════════════════════════════════════════════════╗
║ Title ║ Description ║
╠════════════╬══════════════════════════════════════════════════╣
║ Habilitar ║ <div id=lingualy-logged-in ║
║ MSDTC ║ style="display:none;"></div> ║
║ ║ <p class=MsoNormal> ║
║ ║ <span lang=ES-TRAD> ║
║ ║ Estoy haciend ║
║ ║ <span style="color:#1F497D;">o</span> ║
║ ║ un proceso que labura con ║
║ ║ dosBDs y una Transaccion. ║
║ ║ Tengo un error porque necesita el servicio ║
║ ║ DefaultTransaction Coordinator (MSDTC). ║
║ ║ <span style="color:#1F497D;"> </span> ║
║ ║ </span> ║
║ ║ </p> ║
║ ║ <p class=MsoNormal> ║
║ ║ <span lang=ES-TRAD style="color:#1F497D;"> ║
║ ║ ║
║ ║ </span></p> ║
║ ║ <p class=MsoNormal><span lang=ES-TRAD> ║
║ ║ Estoytrabado con esto, gracias de antemano! ║
║ ║ </span></p> ║
║ ║ <p class=MsoNormal><span lang=ES-TRAD><br> ║
║ ║ </span></p> ║
║ ║ <p class=MsoNormal><span lang=ES-TRAD>Error: ║
║ ║ </span></p> ║
║ ║ <p class=MsoNormal><span lang=ES-TRAD><br> ║
║ ║ </span></p> ║
║ ║ <p class=MsoNormal><span lang=ES-TRAD> ║
║ ║ Network access for Distributed ║
║ ║ Transaction Manager (MSDTC) has been disabled. ║
║ ║ Please enable DTC for network ║
║ ║ access in the security configuration ║
║ ║ for MSDTC using the Component Services ║
║ ║ Administrative tool.&quot;}<br> ║
║ ║ </span> ║
║ ║ </p> ║
║ ║ <div id=lingualy-logged-in ║
║ ║ style="display:none;"></div> ║
╚════════════╩══════════════════════════════════════════════════╝

Ok, what is this??

We’ll need to remove these tags, and also reserved language words such as &nbsp;, &quot;, etc.

Targets Balance

Balance for classes in both targets

We have another thing to deal with: Data is highly imbalanced, so no matter how do we process text, we will probably have problems to train this model. Also, dataset is not big, we have nearly 18 hundred rows total, and for example less than 0.002% of cases is related to target “No Conformidad Compra / Garantia”. We will evaluate some over/down sample techniques to try to balance this a little bit.

Getting to the root

The first step towards training a classifier with machine learning is to transform each text into a numerical representation in the form of a vector. In order to do that, we must first keep just relevant words from our model. Then, it would be nice to have just the root of each word. Meaning that “I broke my PC” and “My PC is broken” is equal to “Break PC”. Here “Break” is the root word, we want both cases to be same.

type_pipeline_preprocess = Pipeline([
('clean', htmlCleanerPipeStep()),
('lower', lowerCasesPipeStep()),
('stopwords', stopWordsPipeStep('utils/stopwords.txt')),
('lemmatize', lemmatizePipeStep())
])
component_pipeline_preprocess = Pipeline([
('clean', htmlCleanerPipeStep()),
('lower', lowerCasesPipeStep()),
('stopwords', stopWordsPipeStep('utils/stopwords.txt')),
('stemmize', stemmizePipeStep())
])

These are our preprocess pipelines. First we’ve created some straightforward regex to remove HTML tags, and some other special characters. Then we convert every word to lower, filter not relevant words for our model such as connectors, prepositions, specific words from our business which are not important to determine ticket category, etc.

In the last step, for token normalization we choose lemmatization for one case and stemming for other. Why is that? Basically with lemmatization you get the actual root of words, meaning its inflected form. For example lemma of forms “studies” and “studying” is “study”.

With stemming you just get the root prefix for every word, and those words are not necessary within the language. For example stemme of “studies” is “studi”. Stemme of “studying” is “study”. For other words it could match with actual root of word.

In our case it just worked better this way, since component is a little harder to predict, with stemming we have more information about the word. We recommend you to try both.

Unprocess words at left and different pipes applied on the right

So far, we applied these transformations, now we have to vectorize our words to have a numerical matrix to train on.

Vectorization

By doing this, we will transform all of our processed sentences into numerical vectors representing original words. We used two Word Embedding techniques based on our domain: Word2vec for Type prediction and Tf-Idf for Component, and we added it to each pipeline. With these configuration we’ve got better results. Nevertheless we could have done a bit more fine-tuning over word2vec to achieve more accuracy in vector similarity. We note it as a future work. We advise you to try both and choose the one that suits better to your model.

Training Time

Right now we have a “simple” classification problem, where we have a matrix with a target column to train and predict. We follow an straightforward approach in which we choose some candidates classifiers from scikit-learn library that tend to work well for NLP such as LinearSVC, SGDClassifier, LogisticRegression, RandomForestClassifier, AdaBoostClassifier, LGBMClassifier, MLPClassifier. We give a quick train for all this classifiers, get the score and choose the best. Then, we’ve performed a GridSearch to help us find better parameters, and fine tune by hand until we got the best results.

If you need more details on how to do this, you should read Choosing between ML models using pipes for code reuse.

This is our final model for Type prediction:

type_pipeline_fe = Pipeline([
('preprocess', type_pipeline_preprocess),
('word2vec', word2vecPipeStep())
])
type_pipeline_fe.fit(X_train,y_train)
X_train_transform = type_pipeline_fe.transform(X_train)
y_train_transform = y_train
from sklearn.neural_network import MLPClassifier
clf_t_nn = MLPClassifier(
hidden_layer_sizes=(100,75),
activation='relu',
random_state=42,
tol=0.001,
alpha=1.3,
early_stopping=True,
n_iter_no_change=20,
validation_fraction=0.1,
verbose=False,
warm_start=False
)
_ = clf_t_nn.fit(X_train_transform,y_train_transform)

We got better score values using this network for type prediction. Second classifier was Logistic Regression, which is faster, but overfit our training.

For Component Prediction:

component_pipeline_fe = Pipeline([
('preprocess', component_pipeline_preprocess),
('tfidf', tfIdfVectorizerPipeStep({
'stop_words':utils.get_stopwords(),
'strip_accents':'unicode',
'use_idf':True,
'ngram_range':(1,3)
}))
])
component_pipeline_fe.fit(X_train,y_train)
X_train_transform = component_pipeline_fe.transform(X_train)
le = preprocessing.LabelEncoder()
y_train_transform = le.fit_transform(y_train)
from sklearn.linear_model import SGDClassifier
clf_c = SGDClassifier()
clf_c.set_params(alpha=0.0005,learning_rate='optimal',penalty='l2',
random_state=42,tol=0.0005,loss='modified_huber')
_= clf_c.fit(X_train_transform,y_train)

Here are the results:

Component Model is Overfitted

This closes the full training cycle, and we are ready to start cycling over and over again, trying to improve our model. From now on, we did several things which won’t worth telling because are tightly coupled to our problem. Nevertheless we leave some ideas below:

  • If it’s possible, get more real data for at least minority classes trying to balance proportion a little bit.
  • Try Over sample minority classes on X_train. Don’t push it or your model will be quickly overfited (as it already is for Component) and won’t generalize cases. Also, consider random sample on existing rows over than trying to create new synthetic cases with SVMSMOTE for example. It’s not a good idea for NLP in general, generated cases have no semantic meaning at all.
  • Try downsample majority classes with above considerations.
  • For multiclass prediction, we’ve tried partial classification, i.e merge minority classes and then predict those minority classes separately.
  • Go back to preprocess pipeline and do it better. This was really helpful for us!
  • Take several minutes to look at your results, plot them in a confusion matrix, so you can visualize exactly where your model is making mistakes.
  • DON’T OVERFIT !

It was always about the data

After trying all of these and maybe more in every combination you could imagine, we’ve reached following results:

Type Prediction target score and confusion matrix
Component prediction target score and confusion matrix

Great! Score it’s really better after trying these ideas. But still we are not quite convinced about it. Look at the yellow squares in second picture. First, there’s no much support for those classes, but we knew that. Following question arises: why is confusing so hardly between Recambio de PC (PC change) and Compras (acquisitions)? And the same goes for Administracion de Servidores y Herramientas (Server Management) and Alta y Baja de Usuarios (User Management).

“Data preparation accounts for about 80% of the work of data scientists”

Searching through those cases we can see that there’s some inconsistency. Tickets with title and description which looks pretty much equal even to human eyes have different tags. And the picture above can reflect that also in a way. There are many wrong-predicted cases (nearly 15%) with over 0.8 confidence, and a mean confidence of 0.609 for predictions that are wrong. That’s huge.

At this point, every decision about changing data should be taken with some IT manager, because probably new coming tickets will have similar issues.

In conclusion, we surf through several nlp and machine learning techniques to achieve our goal. We’ve performed many iterations cleaning data and evaluating tools to better process our text. The same for the classifier model. We got some really good results for Type target and some of the Component classes despite imbalance problems.

Our model will never perform better than our dataset quality.

Feel free to surf through the whole code. Here’s the link.

Thanks to Jonathan Loscalzo who investigated this tools with me, and Nico Gallinal for helping us in this post’s revision and project coordination.

Different Approaches

This is off course not the only road to do NLP. There are other paths to take, here are some of them:

  • AutoML with H2O, perhaps it will get better results. For instance, check this tutorial.
  • Try a Contextual-Based Approach model, such as Bert or ELMo to compare advantages and disadvantages with this non-contextual approach model.
  • Build a cloud based NLP module using Azure.

References

--

--

Julián Gutiérrez Ostrovsky
Hexacta Engineering

Developer. Computer Science Student. Passion for knowledge. Love for Music.