Natural Language Processing

This post provides a brief overview of Natural Language Processing. Its intent is not to exhaustively cover the field but rather to offer a collection of leads for additional reading. A key research area for human-computer interaction, Natural Language Processing is focused on the interaction between computers and natural languages spoken by humans to allow for computers to both understand and generate natural language. Natural Language Processing is increasingly related to Machine Learning as techniques are shifting from manually designing large sets of rules to inferring these rules from a large corpus of text.

This post is structured around the following NLP concepts and tasks:

  1. A basic concept: n-grams
  2. Tagging using Hidden Markov Models
  3. Parsing using Probabilistic Context Free Grammars
  4. Spell Checking
  5. Learning of Word Embeddings

A basic concept: n-grams

A n-gram is simply defined as a continuous sequence of n elements from a text. These elements can be words, characters, or even phonemes. N-grams can notably be used to build n-gram models: models predicting the next word using the last n-1 words seen (thus assuming that the prediction is independent from all words beyond that context of n words). The value chosen for n impacts the size of the context used by the resulting model to make a next word prediction.

As an example application of the concept of n-grams, I implemented a very simple next word predictor in Python. The predictions are made using conditional probabilities learned by reading bigrams from a corpus. You can find it in the following GitHub repository:

https://github.com/npapernot/bigram-next-word-predictor

Tagging using Hidden Markov Models

Hidden Markov Models (HMMs) offer representations of probability distributions over sequences of observations sampled at discrete time intervals. At each time step, the observation is assumed to be generated by a process depending on some discrete hidden state not accessible to the observer. The discrete hidden state is assumed to follow the Markov property: at any given time t, it only depends on the previous state at t-1.

An application example of Hidden Markov Models is the Part of Speech Tagging problem. The problem consists in tagging words of a sentence with their part of speech — in other terms the class they belong to according to their grammatical properties: noun, verb, etc… The Viterbi algorithm is a dynamic-programming algorithm based on a Hidden Markov Model that solves this problem in polynomial time (with respect to the number of possible tags). More details can be found in the following note.

Parsing using Probabilistic Context-Free Grammars

Probabilistic Context-Free Grammars (PCFGs) extend Context-Free Grammars (CFGs) by defining a probability distribution over possible derivations when a grammar is ambiguous. More precisely, CFGs are extended by defining a parameter for each rule `a ->b` which can be seen as the conditional probability of choosing that rule in a left-most derivation given that the non-terminal being extended is `a`. The probability of a left-most derivation is then computed as the product of all conditional probabilities corresponding to the rules chosen to yield the derivation. The conditional probabilities for each rule can be evaluated over a corpus of derivations by dividing the number of times the rules was chosen by the number of times the non-terminal it extends was seen. Using PCFGs to parse sentences has applications in relation extraction and translation for instance. The following document covers PCFGs in more details.

Spell Checking

Spell checking is one of the applications of natural language processing that impacts billions of users daily. A good introduction to spell checking can be found on Peter Norvig’s webpage. The article introduces a simple 21-line spell checker implementation in Python combining simple language and error models to predict the word a user intended to type. The language model estimates how likely a given word `c` is in the language for which the spell checker is designed, this can be written as `P(C)`. The error model estimates the probability `P(w|c)` of typing the misspelled version `w` conditionally to the intention of typing the correctly spelled word `c`. The spell checker then returns word `c` corresponding to the highest value of `P(w|c)P(c)` among all possible words in the language. More details and source code can be found on Peter Novig’s webpage.

Learning of Word Embeddings

Deep Neural Networks allowed for very impressive results in automatically learning word embeddings from simple natural language processing tasks. Briefly speaking, word embeddings are functions transforming words of a language into high-dimensional vectors, allowing one to define complex operations — like comparison, addition, subtraction — on words of a language. More details can be found on the following blog post.

An exciting application of word embeddings is machine translation. As introduced in Zou et al. (cf. infra), word embeddings can simultaneously be learned in two languages. By constraining pairs of words — from both languages — known as translations of each other to share a similar word embedding, one can construct a space common to the two languages. Such bilingual word embeddings allow for automatic machine translations of words with unknown translations by using their word embedding as an intermediary between original words and translated words.


If you have any suggestions on how to improve this post, feel free to reach out to me on Twitter http://twitter.com/NicolasPapernot or in the comments below.