How I used Machine learning to do the most boring data tagging job

The tale of completing a 22-hour job in 9 hours

9 min readApr 11, 2019

A few days ago I got the opportunity to work on a task which might be considered “boring” by most software engineers and yet very challenging by most NLP engineers: Data labeling.

The dataset in question consisted of real-world chat conversations between retail consumers and a consumer-facing bot. The real-world chat conversations are noisy, unstructured and misclassified. This had to change, obviously because an ML model is as good as the data it is trained upon. The task was to clean this raw and unstructured data, separate signal from noise and to tag the misclassified intents appropriately. Every consumer-facing enterprise that is using chatbots has to go through this problem. Businesses can either hire a domain expert or crowdsource on Mechanical Turk. But sharing data outside the organization is against data privacy laws and the whole process can be quite expensive. After cleaning the data to an acceptable standard, the goal was to train the bot’s underlying ML model in order to improve its accuracy and make it more “intelligent”.

One way to achieve this task was to have a domain expert manually sift through the dataset and correct it. However, when the constraints of time and effort are considered it is clear that this is not the ideal way to approach the problem.

Typically this process comprises of following steps.

Note: You should be familiar with NLP data preprocessing, word embeddings, topic modeling, and clustering. For those unfamiliar with these, I have linked a reference for a quick refresher.

A quick look at the data:

Firstly we will start with basic data analysis where we look at the size and other high-level features of the dataset. You can find the code here.

Note: You can skip this step and can directly jump to message clustering but it’s crucial to explore the data first because it provides the context needed to develop an appropriate model — and interpret the results correctly.

It seems that dataset is small containing only 2 columns and 141 training data.

There are two columns Message and Intent.

Message column has chat conversations which comprise the ‘features’ for an ML model.
Intent column is the category of classes which are the ‘target variable’ for tagging.

Note: Here is what the data looks like once you cluster it using the approach discussed in this article along with some manual tagging. An intent column is given so that you can try a different approach and compare your result with an expected result.

Data Cleaning:

Usually, in the text, data cleaning is required because real-world chat conversations are full of slangs, shortcuts, emojis, etc. The text data is cleaned by removing stop words, expanding shortened words, etc. There are more preprocessing techniques to convert raw data into clean dataset based on the type of data set.

Data Exploration:

After data cleaning now let us look into what the user is saying to the bot. This will give more information about types of topics i.e intent

We will use information retrieval algorithm TF-IDF which is widely used by search engines like Google, Bing to check how relevant the keyword is throughout the document. Check this for more information.

You can sort words based IDF score or create a word cloud for better data visualization.

We see that users are talking the most about order, delivery, item, return, change, refund, cancel, etc but we are not sure in what context topics like order, delivery, an item is been discussed as TF-IDF doesn’t account for word position and context while scoring words.

We can visualize word corpus using word embeddings which are pre-trained on a neural network based on nearby context words. Pre-trained Glove embeddings are used for this experiment.

For more information about word embeddings, you can check here.

Note: You can use any embedding which has the highest word coverage with respect to your data or you can train your own if you have enough data. Please note that you will get different word embeddings representation based on hyperparameters and pre-trained word vector you choose.

For simplicity, I have considered only important words which are part of the TF-IDF list and grouped the words in clusters based on the proximity of words. I have jotted down my observations below, Feel free to use your own imagination.

Cluster 1:
You can see the order is near to find. order & find might imply that finding or looking for a specific order

Similarly,
order & add & item: Adding item to an existing order
order & update/change: Updating or changing specific order
order & return: Returning order
order & due: Order is due

Cluster 2:
cancel & refund & receipt: User wants to cancel the order and get a refund

Cluster 3:
email & address, telephone & number, service: email address and telephone number to contact customer service.

Let me know what do you think about cluster 4–7 in the comment.

You can play with TF-IDF parameters i.e max_df, min_df, ngram and try different word embedding which suits your data.

Do you think you could have got similar information by manually going through each and every user message (How frustrating 😤)

In the Data exploration phase, we cleaned and explored the data using various techniques like TF-IDF and word embeddings which helped to trim high and low-frequency words and use pre-trained word vectors for visualizing word corpus. We got enough information about topics of conversations. Now will move to convert text into vectors and cluster them using a popular ML algorithm.

Feature engineering:

The next step is to convert text into vectors as required for using clustering algorithms.

You can convert text into vectors using Word frequency, TF-IDF or Doc2Vec and input it to algorithm but here topic per document matrix from LDA gave me best results based on the following reasons. Consider below text corpus,

It’s good that TF-IDF penalizes very frequent and rare words but it creates a very sparse matrix as shown in the image (a) which does not work well with clustering. Check stack overflow discussion on clustering of sparse data here and here.
As word embeddings are based on a neural network, it depends on hyperparameters like window size, the number of iteration, dimensionality of the feature vectors among others which might not create right input feature because of probabilistic nature of the neural network. And, it does not consider other messages in corpus while embedding but it’s better than TF-IDF as it creates the dense vector. From the dataset, sent 2, 3 and 7 are similar and their word vectors lie in same vector space but word brown is far away from the highlighted circle in the image © also words like king’s and today lies near to dog which might affect clustering in next step.
While in topic modeling it extracts topics from the document which can be very well used as input feature because of it not only penalizes very frequent and rare words but also consider other documents in the corpus. Topic 0 is created out of the same set of messages (2,3 & 7) as shown in image (b) which is a better feature for the clustering algorithm than word vectors and TF-IDF.

I have used a small hack to remove noise from the data. There are unique numbers, alphanumeric characters that will not be part of word embedding. Also, each such tokens are considered as the separate feature which is irrelevant to an algorithm.

Let’s see if we can identify them and replace some pattern with a unique token.

The list containing numbers and alphanumeric characters.

We identified 6 digit numbers, 6 digit numbers separated by period, dates and times etc.

Replace them with a unique token like 6 digit number to number_1, number separated by a period to number_period and remove others.

Message clustering:

Now let’s get to what we are here for: data tagging.

K means is used for clustering because it is probably the most well-known algorithm and it’s very easy to implement in Python. An optimum number of clusters are decided based on silhouette analysis. There are plenty of other clustering algorithms out there. You can check this amazing article and try different clustering algorithms. Feel free to let me know your results in the comment box.

BOOM, we have the result and it has already reduced most of the manual work by clustering similar utterances together. We can quickly skim it and tag it to its respective intent, as we already know the topics that we have explored in the data exploration step.

Let’s check out a few clusters,

It took me around 4 days(~22 hours) to manually tag entire original dataset, whereas with this approach I could complete in 1.5 days(~9 hours) excluding development of the process. Few clusters do not make sense because they are merged or topics might be overlapping. It could be because of the topics that were getting blended together didn’t have enough documents to stand out. You can improve the topic modeling by using semi-supervised GuidedLDA where you can go back and debug where you went wrong in the decision-making process. The method is to set some seed words for suffering topics. Then guide the model to converge around those terms.

Alternatively, thematic analysis can be applied instead of topic modeling as described in this article. Check this kernel on thematic text analysis using spaCy.

Conclusion:

This is how I used various NLP techniques to handle messy text data and map it to the right intent with a bit of manual effort. This helped me to prepare data quicker for training the ML algorithm without banging my head. Let me know if you ever come across any such problem and how did you tackle it. In the next article, I will discuss how I handled imbalanced data and my experience with different text classification algorithm. I would encourage you to pull this code and try a different approach to improve clustering and do feedback if I missed anything which could help to improve the results. If you liked what you just read, please help others find it. Thanks a lot!

Have you always wanted to have more hands-on NLP learning? How about a playground where you can practically approach these problems?
Check out this community contributed GitHub repo.

colearninglounge/co-learning-lounge

Teach your computer to understand and speak the langauge. This repo will help you to become zero to hero in Natural…

github.com