The theory you need to know before you start an NLP project

An overview of the most common natural language processing and machine learning techniques needed to start tackling any project involving text.

Arne Lapõnin
Towards Data Science
14 min readJul 27, 2019

--

Introduction

I have been working on a project about extracting information from unstructured text from January. Before starting this project, I didn’t know anything about the field of natural language processing (NLP). Once I started researching about the field I quickly encountered “Natural Language Processing with Python”, available here. This book proved to be too theoretical for me, though it gets the fundamentals right, so it’s still an invaluable resource. My next big discovery was the “Text Analytics with Python” by Dipanjan Sarkar, which I have read cover to cover. It’s a fantastic book that taught me all the technical skills I needed to start an NLP project. The second, greatly expanded, edition [6] was recently released.

In this article, I want to give an overview of some of the topics I got to know while acquiring NLP skills. I know that there are many great posts out there that cover the same things, such as this awesome series from Sarkar, but writing things down has really helped me in structuring everything I know.

What this post covers

This post will mainly be theoretical for the sake of brevity. I will write more practical articles in the future. Now, I will cover several topics, such as:

  1. Processing text using NLP
  2. Extracting features from text
  3. Supervised learning on text
  4. Unsupervised learning on text

Pre-processing text

A typical pipeline for pre-processing text consists of the following steps:

  1. Sentence segmentation.
  2. Normalization and tokenization.
  3. Part-of-speech (POS) tagging.
  4. Named entity recognition.

In most applications, not all the steps are necessary to be performed. The need for named entity recognition depends on the business needs of the application, while POS-tagging is typically done automatically by modern tools to improve parts of the normalization and tokenization step.

Sentence segmentation

During the first step of the pre-processing pipeline, the text is segmented into sentences. In many languages, such as English, punctuation, especially the full stop/period character, exclamation and question marks can be used to identify the end of the sentence. However, the period character can also be used in abbreviations, such as Ms. or U.K., in which case the full stop character does not signify the end of a sentence. In these cases, a table of abbreviations is used to avoid misclassification of sentence boundaries. When the text includes domain-specific terms, one has to create an additional dictionary of abbreviations to avoid unnatural tokens.

Tokenization and normalization

Tokenization corner case

Tokenization is dividing the text into words and punctuation marks i.e. tokens. As with sentence segmentation, the punctuation marks can be challenging. For example, U.K. should be considered one token, while “don’t” should be split into two tokens: “do” and “n’t”.

Stemming and lemmatization are important parts of the normalization process. Normalization consists of stemming and lemmatization. During the stemming process, the stem of the word is identified by removing suffixes, such as –ed, and -ing. The resulting stem is not necessarily a word. Similarly, lemmatization involves removing prefixes and suffixes with the important difference that the result belongs to the language. This result is called a lemma. Examples of stemming and lemmatization are visible in the table below.

Examples of the differences between stemming and lemmatization

Both techniques reduce noise in the text by transforming words to their base form. For most applications, such as text classification or document clustering, where it is important to keep the meaning of the word, it is better to use lemmatization instead of stemming. For example, meeting (noun) and meeting (verb) would be both stemmed to meet, thus losing its original meaning, while the respective lemmas would be meeting and meet.

Other normalization techniques include: expanding abbreviations, removing digits and punctuation, correcting typical grammar mistakes etc. Most of these operations can be accomplished through the use of regular expressions.

Part-of-speech tagging

This step is about classifying tokens into part-of-speech (POS) classes, also called word classes or lexical categories, based on the words’ context and definition. POS classes are nouns, verbs, prepositions, adverbs, etc. Lexical categories with examples in English are shown in the table below. POS-tagging improves lemmatization and is necessary for named-entity recognition.

Examples of the common POS classes

There are three types of POS-taggers: rule-based, statistical, and deep learning based. Rule-based taggers rely on explicit rules, such as an article has to be followed by a noun, to tag the tokens. Statistical taggers use probability models to tag individual words or sequences of words. Rule-based taggers are very precise but also language-dependent. Extending the tagger to support other languages requires a lot of work. Statistical taggers are easier to create and language-independent, though they sacrifice on the precision. Nowadays, hybrid approaches of rule-based and statistical models are used, though most of the industry starting to move slowly towards deep learning solutions, where models are trained on pre-tagged sets of sentences. Hybrid and deep learning based approaches improve on context-sensitive tagging.

Named-entity Recognition

Before named-entities can be recognized, the tokens have to be chunked. Chunking means segmenting and labeling sets of tokens. One of the most commonly used chunks is the noun phrase chunk that consists of a determiner, adjectives, and a noun, for example, “a happy unicorn”. A sentence “He found a happy unicorn” consists of two chunks “he” and “a happy unicorn”.

Named entities are noun phrases referring to specific objects, such as individuals, organizations, locations, dates, and geopolitical entities. The goal of the named-entity recognition (NER) step is to identify mentions of named entities from the text.

Sentence with NER tags

Machine Learning

As Brink et al. define it, machine learning (ML) is about taking advantage of patterns in the historical data to make decisions about new data [1], or as Google’s Chief Decision Scientist Cassie Kozyrkov so eloquently put: “Machine learning is just a thing-labeler, taking your description of something and telling you what label it should get.” Applying ML techniques is useful when the problems at hand are too complicated to solve through programming, such as distinguishing different cat breeds on images, or the solutions need to adapt over time, such as recognizing hand-written texts [2].

Typically, machine learning is divided into supervised and unsupervised machine learning [2]. We can use supervised learning when our historical training data contains labels (for example, “duck” and “not duck” in the figure below). On the other hand, unsupervised learning is applied when there no labels in the data. Unsupervised machine learning methods aim to summarize or compress the data. An example of this difference is the problem of detecting spam email versus anomaly detection. In the first case, we would have the training data with the spam/not-spam labels available, in the latter case, we would have to detect anomalous emails based on the training set of emails.

The difference between supervised and unsupervised learning. Source

Feature extraction

All machine learning algorithms require numerical data as input. This means that textual data has to be converted to numerical. This is the essence of the feature extraction step in the NLP world.

Count-based strategies

The easiest approach of converting texts into numeric vectors is to use the Bag-of-Words (BoW) method. The principle of BoW is to take all the unique words from the text and create a text corpus called vocabulary. Using the vocabulary, each sentence can be represented as a vector consisting of ones and zeros, depending on whether a word from the vocabulary is present in the sentence or not. The figures below show an example of a matrix created using the BoW method on the five normalized sentences.

Example sentences
BoW feature matrix created from the sentences above

In order to add more context to the vocabulary, tokens may be grouped together. This method is called N-gram approach. An N-gram is a sequence of N tokens. For example, a 2-gram (bigram) is a sequence of two words, while a trigram is a sequence of three.

Once the vocabulary is chosen, be it 1-, 2-, or 3-gram, occurrences of the grams have to be counted. We could use the BoW approach. The downside of this approach is that popular words become too important. Thus, the most popular method for this is called term frequency-inverse document frequency (TF-IDF).

High-level explanation of TF-IDF

TF-IDF consists of term frequency (TF) that captures the importance of the word with respect to the length of the sentence and inverse document frequency (IDF) which captures in how many document rows the gram occurs with respect to the total number of rows in the document, thus highlighting the rarity of the word. Intuitively, a word has a higher TF-IDF score if it occurs frequently in a document but infrequently in the set of all the documents. The figure below shows an example of a matrix created using the TF-IDF method on the previously seen example sentences. Notice how the score for the word fox differs from the ones assigned to a more frequent rabbit.

TF-IDF feature matrix created from the example sentences

Advanced strategies

Though the count-based approaches can be used to capture the sequence of words (n-grams), they do not capture the semantic context of the words, which is at the core of many NLP applications. Word embedding techniques are used to overcome this problem. Using word embeddings, vocabulary is transformed into vectors in such a way that words with similar context are close by.

Word2Vec is a framework from Google that uses shallow neural networks to train word embedding models [3]. There are two types of Word2Vec algorithms: Skip-Gram, which is used to predict context around a given word, while Continous Bag of Words (CBOW) models are used to predict the next word given context.

Global Vectors method, GloVe, uses co-occurence statistics to create a vector space [4]. This method is an extension of Word2Vec that tends to give better word embeddings. The figures below show an example of the GloVe word embeddings on the example sentences and a graphical representation of the embeddings. As one would expect, similar concepts are close by.

Feature matrix created using GloVe embeddings
Word vectors projected to a 2D space

Another improvised version of Word2Vec was developed by Facebook and called fastText. fastText is a deep learning framework and takes into account individual characters while constructing vector space [5].

Supervised learning

Supervised machine learning tasks are divided into two, based on the format of the label (also called the target). If the target is a categorical value (cat/dog) then it is a classification problem, on the other hand, if the target is numerical (the price of the house), then it is a regression problem. When dealing with texts, we mostly encounter classification problems.

Typical supervised learning pipeline

The figure above shows a typical workflow of a text classification systems. We start by dividing the data into a training and a testing set. The train and the test data has to be pre-processed and normalized, after which features can be extracted. Most popular feature extraction techniques for text data were covered in the previous sections. Once the text data has been converted into numeric form, machine learning algorithms can be applied to it. This process is called training the model — the model learns patterns from the features to predict the labels. The model can be optimized for better performance using model parameters through a process called hyperparameter tuning. The resulting model is then evaluated on, previously unseen, test data. The performance of the model is measured using various metrics, such as accuracy, precision, recall, F1 score, to name a few. In essence, these scores are built for comparing the true labels of the data to predicted labels.

Typical algorithms that are used for text classification are:

  • Multinomial Naive Bayes — belongs to the family of Naive Bayes algorithms, built on the usage of Bayes’ theorem, using the assumption that each feature is independent of each other. Multinomial Naive Bayes is an extension used for classification tasks with more than two different labels (multi-class classification).
  • Logistic Regression — an algorithm that uses a Sigmoid function to predict categorical values. Popular sklearn package allows the model parameters to be tuned in a way that the algorithm becomes usable for multi-label classification as well.
  • Support Vector Machines (SVM) — an algorithm that uses a line or a hyper-plane (in case there are more than two features, thus creating a multidimensional space) to separate classes.
  • Random Forest — an ensemble method that trains a number of decision trees on various subsets of data in parallel.
  • Gradient Boosting Machine (GBM) — a family of ensemble methods that train a sequence of weak learners, such as decision trees, to achieve accurate results. XGBoost is one of the most popular implementations of this family of algorithms.

The last two items in the classification algorithm list are ensemble methods that use many predictive algorithms to achieve better generalization. The results of ensemble methods are usually more averaged than individual models and ensembles tend to work better on larger data sets. Yet, ensemble methods do not necessarily work better on textual data, as Sarkar demonstrated in [6].

Evaluation metrics

Confusion matrix and the metrics derived from it

Confusion matrix is one of the easiest and most intuitive tools available for evaluating machine learning models. It shows the relations between the actual values and the predicted values. Though, the confusion matrix is a valuable tool in itself, the terms associated with it are used as a basis for other metrics. Important terms about the confusion matrix:

  • True Positives — the cases in which we predicted Positive and the actual output was also Positive.
  • True Negatives — the cases in which we predicted Negative and the actual output was Negative.
  • False Positives — the cases in which we predicted Positive and the actual output was Negative.
  • False Negatives — the cases in which we predicted Negative and the actual output was Positive.

The metrics derived from the confusion matrix are:

  • Accuracy — the number of correct predictions made by the model over all the predictions made.
  • Precision — the number of correct positive cases over all the positive predictions, in other words, how many selected items are relevant.
  • Recall — the number of correct positive cases over all the actual positive occurrences, in other words, how many relevant items are selected.
  • F1 score — single score combining Precision and Recall, achieved using the harmonic mean. The harmonic mean is kind of an average when x and y are equal. But when x and y are different, then it’s closer to the smaller number as compared to the larger number.

Accuracy is a useful metric only when labels contain an approximately equal amount of data points. All four metrics range from 0 to 1, with 1 being the best and 0 being the worst score.

Unsupervised learning

Unsupervised machine learning techniques, such as clustering, can be used when a data set to be analyzed does not have labels. The intention behind clustering, a branch of unsupervised learning, is to group together similar objects.

Examples of clustering. Source

There are several categories of clustering algorithms available:

  • Connectivity-based clustering — Also known as hierarchical clustering, connects data points based on the distance between them. There are two types of strategies used to connect these points: agglomerative, a “bottom-up” approach, where each data point becomes its own cluster, with pairs of clusters merged iteratively, or “top-down” divisive approach whereby the whole data space is one cluster being split recursively. For agglomerative hierarchical clustering, two additional metrics are necessary: a distance metric that shows how similar two data points are, Euclidean, Hamming, or cosine distances are typical examples and a linkage criteria that shows how similar groups of data points are.
  • Centroid-based clustering — data is divided into clusters based on the points’ closeness to the centroids of the clusters. K-means is the most popular implementation of the algorithm. The basic algorithm is as follows: (1.) select k as a number of clusters, (2.) assign data points into clusters, (3.) compute cluster centroids, (4.) reassign data points to the closest centroid, (5.) repeat the previous two steps until the centroids do not change.
  • Density-based clustering — data space is divided and clustered into regions of density. DBSCAN and OPTICS are two popular algorithms that extract the dense regions of the data space, leaving behind the “noisy” data in the sparse regions. OPTICS tries to overcome DBSCAN’s weakness of performing badly in the borders and data sets of varying density.

Text summarization

Text summarization can be split into two: topic modeling and automated text summarization. Automated text summarization is a process of using ML algorithms to create summaries of documents or a set of documents. These algorithms perform best with a large number of documents and/or long documents.

Topic modeling, on the other hand, focuses on extracting themes from a collection of documents. Topic models are often called probabilistic statistical models because they use statistical techniques, such as singular value decomposition (SVD), to uncover latent semantic structures from texts. SVD relies on matrix factorization that is a technique from linear algebra which divides the feature matrix, into smaller components. Methods, such as Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), and Non-Negative Matrix Factorization (NNMF) take advantage of techniques from linear algebra to divide a document into topics, which are essentially clusters of words, as illustrated below. Topic modeling algorithms tend to produce better results when texts are diverse.

The essence of topic modeling. Source

Conclusion

I have given a brief overview of some of the important topics that you will encounter once you start a project that involves natural language processing and machine learning. I have barely scratched the surface of this field. I didn’t even touch upon the exciting developments in language modeling using transfer learning, about which you can read in this insightful post from Sebastian Ruder.

In my opinion, this is a very exciting time to start practising applied NLP in the industry. As Yoav Goldberg said at a conference I recently attended, most of the industry is still using regular expressions to solve problems. By understanding the theory I have presented in this post and applying it to real-life problems, you can make some people really happy.

Yoav Goldberg presenting the current state of the applied NLP at spaCy IRL

References

[1] H. Brink, J. W. Richards, and M. Fetherolf, Real-world Machine Learning (2017), Manning Publications

[2] S. Shalev-Shwartz, S. Ben-David, Understanding Machine Learning: From Theory to Algorithms (2014), Cambridge University Press

[3] T. Mikolov, I. Sutskever, K. Chen, G. S Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality (2013), Advances in Neural Information Processing Systems 26

[4] J. Pennington, R. Socher, and C. D. Manning, GloVe: Global Vectors for Word Representation (2014), In EMNLP.

[5] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information (2016), arXiv preprint

[6] D. Sarkar. Text Analytics with Python: A Practitioner’s Guide to Natural Language Processing (2019), Apress

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Arne Lapõnin
Arne Lapõnin

Written by Arne Lapõnin

Data Engineer working at Thoughtworks. I love technology, art, travel, and politics.

Responses (1)