5 simple steps to better NLP

Better Natural Language Processing is just a few steps away. Learn from Southpigalle Data Scientist, Jerome Hagege, the best practices and improvements for improved accuracy and response.

Whether you are doing sentiment analysis or text classification, there is a chance that NLP (Natural Language Processing) will be a main component in your analysis. Let me remind you that NLP is not a new field — the first chatbot was developed in the ’60s. What has changed over time is that people have realized that they could use machine learning based techniques to solve such problems. To give you an idea, traditional named entity recognition systems used to be rule based, meaning that the key component was how to anticipate what the user is about to say. Nowadays, named entity recognition tasks can be done through machine learning techniques.

Doing NLP is not a task to take lightly since any mistake in your pipeline could lead to inaccurate results for your analysis. Through this article, I will try to give you a quick overview of the five rules to follow if you want to do NLP.

Rule #1: Check Your Data

The first and most crucial step before starting any processes is to check the quality of your data. Whether you’re working with messages, articles or audio signals, you need to make sure that you dataset is well structured. For instance if most of the messages in your dataset are very short messages and only a small part of them are long messages (more than one sentence), the imbalance will end up adding bias to your results. Additionally, if you want to make text classification tasks, it is necessary to make sure that your dataset is homogeneous and that every class is equally represented.

Rule #2: Spend Time Cleaning Your Data

When doing machine learning tasks, it is absolutely necessary to clean and pre-process your data. In NLP, the cleaning part consists in normalizing your messages. This can be done through lowercasing letters, removing punctuation or replacing some characters.

Once normalized, the new dataset will be used to fit your model. Your model’s parameters will end up being a direct consequence of your cleaning process, which is the reason you need to make sure that your preprocess step is in harmony with the task you are trying to complete. For example, if you want to do sentiment analysis, the significance you will give to “ ! “ will not be the same as if you do topic modeling (for which you could remove the punctuation).

Rule #3: Carefully Choose The Right Embedding

Of course, you will not send the raw message itself but a vector (column of numbers) which represents your message to your algorithm. This is where the embedding jumps in. The embedding is the way you will vectorize your message. It is key to the process since the embedding will try, if done well, to preserve as much information as possible from your original message. For example, the most standard embedding is the bag-of-words which consists of counting for each word the number of times that appear in the sentence. However, depending on what you are trying to achieve, the embedding you will pick will not be the same. For instance, if you need to integrate the context surrounding each word, the bag-of-words representation will not be powerful enough (since it is a counter, the word order is not caught).

Rule #4: Define The Model That Fits Best Your Problem

As I mentioned earlier, depending on the goal you are trying to achieve, the embedding you will pick for your text will not be the same. This is also true for the model you choose. It is the reason why you need to be very deliberate with your choice. For instance, each row in your vector will be assigned to a feature and, depending on the model, the way you will give credit to any of those features will not be the same. On top of that, it is a common belief that the most complex models are the most efficient ones. That’s a common misconception since sometimes basic, old-school machine learning techniques can provide you with great results (more than 90% of precision/recall/f1-score).

Rule #5: Don’t Be Afraid To Calibrate (and even recalibrate) Your Model

Despite the many steps that I presented above, it’s important not to forget about this calibration step. Once you are done choosing your embedding and your model, you need to define the parameters that will optimize your results. Indeed, sometimes a few changes can result in higher performances. The best way to complete this process is to first define metrics that will translate how efficient your model is for the chosen task, and then calibrate your model in order to increase those metrics.

Doing NLP can be a lot of fun. However, if you are not cautious enough in the way you set up your pipeline, the results can be disappointing. Whether it is the preprocessing, the choice of the model or the calibration of it, each of these steps needs to be carefully completed with concrete metrics that are optimized for the task you are trying to achieve.

Southpigalle is an augmented intelligence company focused on creative, innovative solutions to today’s toughest business problems.
More information