The basics of Language Modeling
Notes from CS224n lesson 6 and 7.
Language modeling is one of the benchmark tasks of NLP. In its simplest form, it consists of predicting the most probable word following a series of words based on them.
There are many applications of this task, namely, google autocomplete and word suggestions in most modern mobile keyboards.
The systems used to perform this task tend to learn from an extensive corpus and thus follow a supervised learning approach.
N-Gram language models
Intuitively, in general, more common words like “cat ”or “dog ”should tend to have higher probabilities then more uncommon ones such as aardvark or kingfisher. Thus, a good starting off point could be the frequency of words in the corpus. A system for this purpose that takes into account only the number of appearances of a word normalized by the number of words in the corpus is called a uni-gram language model. Similarly, bi-gram language models consider the frequency of couples of word, for example, if in our English corpus the couple [united, states] appears more often than [united, the] a bi-gram language model would assign a higher probability to “states ” rather then to “the” to follow “united ”despite the much higher frequency of the latter. Higher-gram language models also exist, but as the dimensions of the sequences of words increase, their frequency in the corpus decreases exponentially. These models thus have a sparsity problem and struggle with infrequent word sequences.
Neural language models
Word embedding can help solve the sparsity problem since with can ditch the one-hot representation of words for their vectors. Therefore we can easily build neural language models that can take into consideration more words than the n-gram models. Despite this advantage, however, the number of words considered remains fixed, and so neural language models still struggle with long-term dependencies.
Recurrent Neural Networks
To increase the context available to our language model, we can use a recurrent neural network, a type of stateful NN. The network for each word maintains a state influenced by both the current word and previous hidden state. This architecture removes the fixes size of the context alleviating the problems of simpler neural language models.
However, due to the way we calculate gradients the loss at step, each step is influenced considerably by the step just before it and very little by past steps because those gradients tend to become exponentially small as the distance from the step increases.
LSTM
One of the architecture proposed to solve the vanishing gradient problem is called Long Short-Term Memory, and it works by having both and hidden state and a memory cell and three gates the form read (in the hidden state), write and delete of the cell. They are called the output, input, and forget gate, respectively. This architecture implements a dedicated way to maintain long-term dependencies, which is never forgetting the memory cell.
These notes are intended as a very barebones summary of lectures six and seven from the CS224n Staford class. Credits for the material belong to the professor, Chris Manning, and TAs of the course.
More material is available at the course’s site and on youtube.