Natural Language Processing

Nicolas Papernot
Jan 23, 2016 · 4 min read

This post provides a brief overview of Natural Language Processing. Its intent is not to exhaustively cover the field but rather to offer a collection of leads for additional reading. A key research area for human-computer interaction, Natural Language Processing is focused on the interaction between computers and natural languages spoken by humans to allow for computers to both understand and generate natural language. Natural Language Processing is increasingly related to Machine Learning as techniques are shifting from manually designing large sets of rules to inferring these rules from a large corpus of text.

This post is structured around the following NLP concepts and tasks:

  1. A basic concept: n-grams
  2. Tagging using Hidden Markov Models
  3. Parsing using Probabilistic Context Free Grammars
  4. Spell Checking
  5. Learning of Word Embeddings

A basic concept: n-grams

As an example application of the concept of n-grams, I implemented a very simple next word predictor in Python. The predictions are made using conditional probabilities learned by reading bigrams from a corpus. You can find it in the following GitHub repository:

https://github.com/npapernot/bigram-next-word-predictor

Tagging using Hidden Markov Models

An application example of Hidden Markov Models is the Part of Speech Tagging problem. The problem consists in tagging words of a sentence with their part of speech — in other terms the class they belong to according to their grammatical properties: noun, verb, etc… The Viterbi algorithm is a dynamic-programming algorithm based on a Hidden Markov Model that solves this problem in polynomial time (with respect to the number of possible tags). More details can be found in the following note.

Parsing using Probabilistic Context-Free Grammars

Spell Checking

Learning of Word Embeddings

An exciting application of word embeddings is machine translation. As introduced in Zou et al. (cf. infra), word embeddings can simultaneously be learned in two languages. By constraining pairs of words — from both languages — known as translations of each other to share a similar word embedding, one can construct a space common to the two languages. Such bilingual word embeddings allow for automatic machine translations of words with unknown translations by using their word embedding as an intermediary between original words and translated words.


If you have any suggestions on how to improve this post, feel free to reach out to me on Twitter http://twitter.com/NicolasPapernot or in the comments below.