Get started with NLP (Part I)

Image from this source

This is the first part of a series of Natural Language Processing tutorials for beginners. Next post gives an introduction to NLP workflows.


Nowadays, language is getting a lot of importance due to the recent boom of the so-called chatbots or conversational agents in several industries. But, as it happens with other fields of human knowledge, the study of natural language has a long past and these agents are not the first important application. Having a deep understanding of language is very important since we use it everyday in different scenarios and with different behaviors.

This series of tutorials have the purpose of serving as an introduction to the amazing field of study that constitutes Natural Language Processing.

What is NLP / ML / DL?

Appart from an approach to communication, personal development, and psychotherapy (Neuro Linguistic Programming), NLP stands for Natural Language Processing, which is defined as the application of computational techniques to the analysis and synthesis of natural language and speech. In other words: the use of different techniques from computer science (algorithms) to understand and manipulate human language and speech.

Sometimes NLP is confused with Machine Learning (ML), but this is just because some tools from ML are applied to NLP and improve the field. We can think of ML as a set of tools such that some of this tools are useful to solve NLP tasks.

Also, to make this clear, Deep Learning (DL) is a branch of ML that makes use of a specific type of architectures or models named Neural Networks to solve learning tasks. We can think of DL models as a subset of tools within the set of ML models. It is not the only existing branch (there are others, such as Genetic Algorithms), but it is obtaining a lot of importance in the ML community due to two important reasons:

  1. The performance of this models when you have a great amount of data is noticeable and nowadays we have such data in several scenarios.
  2. Recent advances in computing capability with the release of better hardware (Graphic Processing Units, GPUs) and sometimes also dedicated (Tensor Processing Units, TPUs) are making this models more feasible.

This is important, since most of the current state of the art of NLP is being obtained through applying DL models. We will explain key concepts of DL for NLP in a later post, but now it is better to focus on the NLP basics.

Finally, in Machine Learning the source is known as dataset, but in NLP and in general when our dataset is a large collection of texts, we usually talk about the corpus.

NLP vs. Computational Linguistics:

There’s an area which is closely related to NLP and sometimes confused with it, that is Computational Linguistics. As Jason Eisner points out, the difference is the following:

  • Computational Linguistics is a more theoretical field that develops computational methods to answer the scientific questions from the point of view of linguists.
  • Natural Language Processing is dedicated to give solutions to engineering problems related to natural language, focusing on the people.

Both fields make use of Computer Science, Linguistics, and Machine Learning.

Main NLP challenges:

The development of NLP has its meaning because of some specific problems and phenomena that arrive when we study natural language. Most of the times, these problems are unique in comparison to the problems that emerge in other fields of computer science or engineering, and that is in part what makes NLP such an interesting and different area.

  • Ambiguity: the main challenge of NLP is the understanding and modelling of elements within a variable context. In language, words are unique but can have different meanings depending on the context in which they are being evaluated. This results in the most important and sounded linguistic phenomena: ambiguity. We can have words (or even sentences) with different meanings in the same sentence depending on the way we interpret these words. This happens because of the difference between signifier (the way we represent the information, word) and signified (the meaning of that information, concept).
  • Synonymy: other key phenomenon of natural language is that we can express the same idea with different terms. This occurs because of synonymy, which is also dependent of the specific context: fine is synonym of correct in the context of performance, but it is synonym of thin in the context of density or profundity or a synonym of penalty fee in the context of punishments.
  • Syntax: other peculiarity of natural language is its structure, which takes into account several rules but also some irregularities in different cases. We can also reorder a sentence in different ways but not every ordering of the terms is valid.
  • Coreference: other challenge that we face everyday although we perfectly cope with it in our conversations is the reference to specific concepts that were mentioned in an earlier sentence or directly omitted because they are deduced from the context. This is called coreference.
Normalization vs. Information
  • Normalization vs. Information: when we process natural language, in order to be able to manage it in a more general way, we need to normalize it. This means that depending on the task we would want all the words to be lowercased or to convert plural terms into singular ones if we don’t want to consider dog and dogs as two different entities. Other times we might encounter different forms of the same verb in a document and we would want to consider just that verb instead of making a distinction between each form. All these processes normalize natural language in some way, and we will learn the techniques that are used to achieve it, but the key idea here is that when we normalize we are losing part of the information in exchange of being able to generalize better. This normalization/information trade-off is common in the study of data, but also very important in the study of natural language.
Word representations
  • Representation: language is composed of characters which we say that are discrete values because they can only take certain values: ‘a’, ‘ b’, ‘c’, etc. In stead of continuous ones, that can take any value within a range (like 0, 0.1, 0.025 or 0.5 in the range between 0 and 1), since there is no possible value between ‘a’ and ‘ b’ in the range between ‘a’ and ‘z’. It is easier to process data when it has continues features since, in terms of learning, we can obtain a number that is close to the value that we want with a certain error (say for example 0.99 to 1 with error of 0.01), but we can’t approximate the word ‘tree’ with a certain error. We will learn that one way of coping with this problem and relating different terms is to transform words into vectors (called word embeddings).
Irony and sarcasm
  • Personality, intention and style: the are also different styles to express the same idea depending on the personality or the intention in a specific scenario. Some of them (such as irony or sarcasm) may have an opposite idea from the one that can be initially thought due to the context. We can state for example “Oh, great” referring to a feeling of joy but also to the completely opposite feeling if we are being sarcastic.

Some basic NLP techniques:

Now that we know what is and is not NLP and what problems does it face, we can start to learn which are the most basic NLP tools. In the next post we will apply these techniques using a Python NLP library called SpaCy. This post focuses on the concepts.

a. Stemming and Lemmatizing: this tasks consist of reducing different forms of a word to a common base form. For example:

  • In the sentence “I am a student” the process would result in “I be a student”.
  • In the sentence “My dog’s fur is dark” the process would result in “My dog fur be dark”.

Stemming usually refers to a crude process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational units (the obtained element is known as the stem).

On the other hand, lemmatization consists in doing things properly with the use of a vocabulary and morphological analysis of words, to return the base or dictionary form of a word, which is known as the lemma.

If we stem the sentence “I saw an amazing thing ”we would obtain ‘s’ instead of ‘saw’, but if we lemmatize it we would obtain ‘see’, which is the lemma.

As it was already mentioned, both techniques could remove important information but also help us to normalize our corpus (although lemmatization is the one that is usually applied).

b. Coreference resolution: consists of solving the coreferences that are present in our corpus. This can also be thought as a normalizing or preprocessing task.

c. Part-of-speech (POS) Tagging: a POS tagger marks each word in a corpus by assigning a syntactic category such as:

  • Open class categories or types (those with relatively fixed membership): noun, verb, adjective, adverb.
  • Closed class typesprepositiondeterminerpronoun, conjunction, auxiliary verb, particle, numeral.

For example, given the sentence “I want to play the piano” a POS tagger should return:

I (Preposition)
want (Verb)
play (Verb)
piano (Noun)

d. Dependency Parsing: sometimes instead of the category (POS tag) of a word we want to know the role of that word in a specific sentence of our corpus, this is the task of dependency parsers. The objective is to obtain the dependencies or relations of words in the format of a dependency tree.

The considered dependencies are in general terms subject, object, complement and modifier relations.

As an example, given the sentence “I want to play the piano” a dependency parser would produce the following tree:

SpaCy’s Dependency Tree visualized with displaCy

Here we can see that the dependency parser that I use (SpaCy’s dependency parser) also outputs the POS tags. If you think about it, it makes sense because we first need to know the category of each word to extract dependencies.

We will see in detail the types of dependencies, but in this case we have:

want — I: nominal subject.
want — play: open clausal complement.
play — to: auxiliary verb.
play — the piano: direct object

where a — b: R means “b is R of a”. For example, “the piano is direct object of play” (which is play — the piano: direct object from above).

e. Named Entity Recognition (NER):

in the real world, in our daily conversations we don’t work directly with the categories of words. Instead, for example, if we want to build a Netflix chatbot we want it to recognize both ‘Batman’ and ‘Avatar’ as instances of the same group which we call ‘films’ , but ‘Steven Spielberg’ as a ‘director’. This concept of semantic field dependent of a context is what we define as entity. The role of a named entity recognizer is to detect relevant entities in our corpus.

For example, if our NER knows the entities ‘film’, ‘location’ and ‘director’, given the sentence “James Cameron filmed part of Avatar in New Zealand”, it will output:

James Cameron: DIRECTOR
Avatar: FILM
New Zealand: LOCATION

Note that in the example instances of entities can be just a single word (‘Avatar’) or several ones (‘New Zealand’ or ‘James Cameron’).

What comes next:

Congratulations! Now you know the very basics of NLP so we can move to action and start learning to use this tools. The next post gives an introduction to NLP workflows. This was my first Medium article so I hope that it was as useful and clear as possible to you. Thank you very much for reading it, I’m open to and will thank any (respectful) form of suggestions and questions.