Classifying Items with NLP

Rafael Alencar
Neuronio
Published in
4 min readMay 14, 2019

Note: a Portuguese version of this article is available at “Catalogando Itens com PLN”

Although text is one of the most important kinds of data nowadays, it is one of the hardest to work with. Different from images and audio, text data must be transformed into numeric values and it can present different meaning depending on the place, context, and even the order of the words. Lucky for us, Natural Language Processing (NLP) has been developing lots new tools to help computers to decipher, understand and even texts just like human beings. Just look at Google Translator advances since it was created.

Many companies have been using these tools to understand their clients and automate their processes. Sentiment analysis is being used to get client opinions and create marketing campaigns and products recommendations. In this article, we will use this tools to classify an E-Commerce set of products.

Analysing your Data

First thing to do when in a machine learning project is to know your data. In this case, we will check the samples distribution, our corpus length statistics, most frequent words, etc. The dataset used for this project was an Amazon catalog, obtained from dataword, containing around 10 thousand samples. Look down for some information about its data:

Dataset Statistics

We can notice that this dataset is unbalanced. This could be a problem, our model can start learning the data distribution instead of de text information. One way to solve this is filtering the labels with fewer samples, or doing data augmentation to increase the number of samples.

Another good thing to check is the content of your data, using a word cloud, for the most frequent words in the dataset. The image below shows the dataset word cloud, and we can see many words used to describe these products.

Dataset’s most frequent word cloud

Preprocessing

Before creating our model, let’s cleanup our dataset. When working with text, we should start converting all characters to lowercase, remove
punctuation marks and special characters. Besides, it is a good idea to remove stopwords, common words of a language such as articles, prepositions, that bring little information to the text.

Now, it is time to turn our texts into numeric values, a process we call Embedding. First thing to do, is to tokenize our texts, create a vocabulary limited by 20 thousand tokens, and make. Then we must vectorize our texts. One way to do it is using a bag of words approach, turning each word of your text into a piece of information without its context. So, each text should turn into a vector with the same size of the vocabulary, and we can attribute to each token a value, that can be ones and zeros to check if the token is present or not, count how many times it appears, or evaluate its frequency in the text compared to its frequency in the entire dataset (tf-idf).

Another way to do it, is analyse each word context. So, we turn each word of the text into a vector. Each element of this vector represent a token. Normally we use the 100 most relevant ones, and each element contains a score relating the word and the token. Currently, there are lots of algorithms that can analyse lots of texts and create this vectors for us, the most famous is called Word2Vec.

Example of Word2Vec embedding, tokens in blue, vectorised words in green

Classification Models

This project is an example of how we can automate a process using machine learning and text. For this task, we can use products names, descriptions or technical information to train our model and based on the text size, we can determine how sophisticated our model should be.

Before using a neural network to solve the problem, we can create a benchmark using a simpler mode. We will use Multinomial Naive Bayes, a probabilistic model, commonly used for this kind of task. After establishing our ‘goal’, we must try to achieve better accuracies using more complex models. We will try using CNNs and RNNs, due to their architecture being perfect for working with data with context. To check more details about the models' architecture and implementation, click the link to the project on GitHub at the end of this article.

It is possible to see we got very nice results. Although we can't see much difference from the first model to the more complex ones, we can tune hyperparameters and train them for more epochs to achieve even better results, something that is not possible with the first one.

Summary

We just showed that even though that text data is not the easiest one to work with, it can be extremely helpful, creating models to speedup processes. As I said earlier, NLP has lots of applications, and there are lots of new tools that help us to use it in our projects. We just need to be creative and find out where we could use it in our business

References

Google Machine Learning Guide, Text Classification

Report on Text Classification using CNN, RNN, HAN

Using Machine Learning to make your e-commerce grow

--

--