NLP Glossary for Beginners
Demystify Natural Language Processing with these terms.
Natural language processing sounds a lot more daunting than it actually is. In reality, the basic requirements for dabbling in this area are rudimentary programming skills, familiarity with data science principles and a conceptual understanding of how machines process languages.
This glossary will focus on natural language processing for written languages. It will be structured similarly to my other Glossary Article featuring general Data Science terms.
For more reading on NLP:
- Introduction to Natural Language Processing
- Spam or Ham
- Neural Networks for Word Embeddings
- Recurrent Neural Networks for Language Understanding
- Predicting Wine Quality using Text Reviews
Natural Language Processing (NLP)
NLP is the area of machine learning tasks focused on human languages. This includes both the written and spoken language.
The entire set of terms used in a body of text.
Out of Vocabulary
In NLP, data used to train our model consists of a finite number of vocabulary terms. Very often, we will encounter out of vocabulary terms when using our model for inference. Typically, a common placeholder is assigned for these terms.
Corpus (Plural: Corpora)
A corpus is a collection of text. A corpus can be a collection of movies reviews, internet comments or conversations between two people.
Document refers to a body of text. A collection of documents make up a corpus. For instance, a movie review or an email are examples of a document.
The first step to any NLP task is to preprocess the text. The goal of preprocessing is to “clean” the text by removing as much noise as possible. Common preprocessing steps are described in the techniques section.
The process of breaking a large chunk of text into smaller pieces. This is usually done so that each small piece, or token, can be mapped to a meaningful unit of information. If we choose to break our text on the word level, each word becomes its own token.
Each token is embedded as a vector before it can be passed to a machine learning model. While generally referred to as word embeddings, embeddings can be created on the character or phrase level as well. Following the techniques section is an entire section on different types of embeddings.
A contiguous sequence of n tokens in a given text. In the phrase the day is young, we have bi-grams (the, day), (day, is), (is, young).
A new deep learning architecture introduced in 2017 which surpassed several prior benchmarks for NLP tasks. Transformers compensate for two of the Recurrent Neural Network’s shortcomings: ability to parallelize computations and is better suited to learn long-term dependencies between words.
Parts of Speech (POS)
The syntactic function of a word. We are all probably familiar with the different parts of speech in English: noun, verb, adjective, adverb … etc.
Parts of Speech Tagging
The process of assigning a parts of speech tag to each token in the text.
The process of reducing similar tokens to a canonical form. For instance, if we believe hello and Hello are for all intents and purposes the same, we can normalize our text by mapping both terms to hello.
These words are ignored prior to any preprocessing or modelling tasks. Stop words are chosen based on their insignificance to the NLP task at hand. For instance, the
nltk list for English stop words identifies common words such as a, to, can for exclusion.
A normalization technique of grouping inflected terms to their base form conditioned on the parts of speech of the text. For example, walking and walked would both be mapped as walk.
Similar to lemmatizing, stemming also reduces inflected terms to their base forms. The only difference is that the parts of speech tag is not used to determine the base form.
Note on lemmatizing vs stemming: you might think, why would I ever need to use stemming? Wouldn’t this reduction step be more accurate if we knew the parts of speech? One clear advantage of stemming is that it is much faster. Another is that it eliminates the margin of error created by automatic parts of speech taggers.
Common NLP Tasks
The automated process of detecting sentiment from text. A common application of sentiment analysis is to determine whether a review is positive or negative, or whether the text supports a certain sentiment.
This one is self explanatory — all of your favourite automated translation tools are NLP applications.
Machine (Reading) Comprehension/Question Answering
Machine Comprehension, usually carried through Question Answering, is the task of automatically “understanding” the text. It’s usually tested through reading comprehension questions where the input is a contextual document and a set of questions that can be answered using the document. The AI infers the answers based on these inputs.
Named Entity Recognition (NER)
The automatic extraction of relevant entities (such as names, addresses and phone numbers) from an unstructured document.
Information Retrieval/Latent Semantic Indexing
The automatic retrieval of information from a large system (think web search engines). The problem is defined as returning the right document(s) when provided with a specific query.
Bag of Words
This is the simplest method of embeddings words into numerical vectors. It’s not often used in practice due to its oversimplification of language, but commonly found in examples and tutorials.
Consider these documents:
Document 1: high five
Document 2: I am old.
Document 3: She is five.
This gives us the vocabulary: high, five, I, am, old, she, is. For simplicity, we will ignore punctuation and normalize by converting to lower case. We can construct a matrix which represents the number of times each vocabulary term occurs in a document.
This gives us the Bag of Words representation of each word and document. Move horizontally to get the word representation: high is [1, 0, 0]. Move vertically to get the document representation: document 1 is [1, 1, 0, 0, 0, 0, 0].
TF-IDF (term frequency — inverse document frequency)
Unlike Bag of Words, TF-IDF considers the relative importance of each term to document. The vector representation of each term and document can be extracted in a similar fashion as Bag of Words.
The TF-IDF statistic for term i in document j is typically calculated as:
The document vectors can be used as features for a variety of machine learning models (SVM, Naive Bayes, Logistic Regression … etc).
Trained over large corpora, word2vec uses a shallow neural network to determine semantic and syntactic meaning from word co-occurrence. I will redirect to my previous article in explaining word2vec.
word2vec creates a high dimensional feature space. More complicated embeddings such as these is well suited for recurrent neural networks where the word ordering is also taken into consideration.
Context Dependent Embeddings
word2vec embeddings are independent of context — each word will be mapped to the same vector irregardless of its surrounding context. Embeddings such as BERT, ELMo create word embeddings that vary based on the context of the phrase.
Notable Python Libraries
sci-kit learn provides a suite of useful embeddings options including Bag of Words and TF-IDF. All of these functions come with the option to customize the vectorizers, including building terms using n-grams and excluding uncommon terms.
There is a sequential component to language modeling. The ordering of words matter a lot. As such, deep learning models such as recurrent neural networks are incredibly popular for NLP tasks. In Predicting Wine Quality using Text Reviews, I use Keras to train a bi-directional recurrent neural network with GRU units for classifying wine reviews.
nltk, which stands for Natural Language Toolkit, offers several tooling around NLP which makes preprocessing easy. It comes with several stemmers, lemmatizers, tokenizers and de-tokenizers. One advantage of
nltk is that it supports several different algorithms for each function — it’s great for exploration and to provide an understanding of what happens in the backend.
spaCy is a general purpose library for preprocessing text.
spaCy offers support for more languages and outperforms
nltk against most benchmarks.
Here on some terms that were not fully covered by this glossary. I hope to cover them in more detail in subsequent articles:
- Transformers & Attention
- Contextual Language Embeddings
That’s it for now — stay tuned for next time!