This is the second part of a series of Natural Language Processing tutorials for beginners. For an introduction to NLP, please check out the first part.
Let me tell you a story.
Tim is a little guy who likes to build things with his varied Lego© pieces. However, Tim is a bit messy, so he keeps all the components of his different puzzles in a big drawer. He now wants to build a castle with some Lego© pieces but, in order to do that (since he is very meticulous with his artworks), he first selects the pieces that could be useful from the entire set by selecting only Lego© pieces. Then, he removes the damaged pieces and the ones that he does not find useful for his purpose, and unifies some small ones that could fit together in a block. Later, he groups pieces by their colour, shape or purpose. Shortly after, Tim brings all the pieces together in a specific order and builds his mighty castle. Finally, he can play with it.
In this post we will see that, when we try to approach an NLP task (and also most of Machine Learning ones), we should behave like Tim.
Just as Tim has his drawer, we have different data sources with different types of data. We as humans are drawers in a way, with all of our biometrics, phone calls, contact agenda, YouTube views, films, series, songs, text messages, tweets, books, activities, commutes, and so on.
That said, we can refer to two main types of data:
- Structured data: information that has a pre-defined structure, which is typically represented in a numerical way but can also include text (to denote classes for example). A good example is an SQL table. SQL is a standard language for storing, manipulating and retrieving data in databases. This means that data is structured and related through different columns. In figure 1 we can see 2 columns that can take numerical values (EmployeeId and DepartmentId) and other two for textual information (LastName and Coutry).
- Unstructured data: this is a large chunk of the total amount of data that we consume and produce every day: information that does not have a well defined systematic structure. When we deal with natural language, we are dealing with unstructured data: we can’t specify a universal structure or an invariable range of values that a sentence can have (in contrast with the example above).
I like to think of this type of information as information that we don’t currently know how we process and manipulate (cognitively). We can conceive NLP as the different set of tools that can be applied in order to structure natural language for different purposes. As it was mentioned in the first part of the series, our dataset will be called corpus, since it is composed by a set of textual information (the plural is corpora). We can think of this corpus as the set of Lego© pieces that Tim selects initially, by discarding pieces from other puzzles.
When Tim removes the damaged or unuseful pieces from his set, this is called the preprocessing step, in which he tries to select useful pieces for the building process. When we preprocess the data, in NLP we name it text normalization or data preparation, since we are trying to ‘normalize’ in some way the elements of our corpus.
The following posts will cover each of the different tools that can be applied in each step, but, for the sake of argument, if we have both cat and cats in our corpus, we would be interested in normalizing them by unifying both terms into cat (this, as we will see, is called stemming). Other examples would be splitting the corpus into different sentences or removing URLs and other elements that could be present in our corpus and may not be of our interest for the task.
Just as Tim groups pieces in different ways, depending on our objetive we may want to organise our corpus in different ways. This is the structuring step, in which we choose which are the elements that we want to detect. Depending on the objetive we could be interested in the extraction of several things, including, but not limited to:
- The entities that are present in the text. For example, suppose that our corpus includes several countries, names of people, trademarks or animals that we want to detect as different elements.
- The intents or actions written by users. Imagine that our corpus is comprised of online flight bookings from different users, then the intents would be the actions or options that the user could have available. In this context bookRequest (with sentences such as: ‘I want to book a flight, please’), setBookingDate (e.g. ‘Is there any booking available for tomorrow?’) or saveBookingProcess (e.g. ‘I want to continue later. Please, save it’) would be example intents.
- The grammatical function of the words in the corpus, which is called Part Of Speech (POS) tagging. We may want to extract only the verbs and nouns from the corpus to analyse them.
- The syntactic role of each element in the corpus through dependency parsing, maybe to extract only the subjects from the corpus or other dependencies.
- The different topics that are present in the corpus, grouped by terms through topic modeling techniques.
- The feelings that are present in the text, quantifying them through sentiment analysis. If our corpus is made up of tweets related to a contentious matter in today’s society we would want to measure how is people feeling about it: angry, happy, sad, ...
As it was already mentioned, the structuring tools applied will depend on the objective that we have.
Once we have applied the structuring techniques we can proceed to look at the structured information. This is the analysis step, in which we extract different features in order to perform the task that we had in mind. We now want to bring all the pieces together in a specific order like Tim did to build his castle.
Sometimes this step just consists in counting down the different elements (e.g. count the number of instances for each sentiment class to get an insight of the overall feeling), but in other cases we may want to use the features in a more complex model, such as a machine learning algorithm (e.g. train a neural network that transforms entities into vectors or translates from one language to another).
A good way to identify what you need to do in this step would be to ask yourself: what do I need to do with the structured data to achieve my goal?
Now that we have our NLP castle, we can use it for our particular purpose just as Tim did with his construction. This is the transformation step, the last one in the workflow. In this step the purpose is to translate the obtained information into an interpretable source in order to make our decision, observe it or analyse it.
As with the analysis, this can vary from problem to problem. Sometimes this step simply involves outputting the conclusions visually (e.g. representing the obtained vectors as points in 2 or 3 dimensions), a proportion (e.g. the percentage for each sentiment class) or a predominant detected class (e.g. the sentiment class that obtained a higher count). It can also involve more complex tasks such as language generation (e.g. generating a novel or personalised response for a user request) or speech synthesis (e.g. producing a response with a human voice).
I hope that this post gives a good initial overview of the complexity and structure of NLP projects. It is a good exercise to ask yourself questions like:
- What would I do with this corpus?
- And what about my own textual data?
- How would I design an NLP workflow for a system like Amazon© Alexa?