In the era of deep models, n-gram
may not seem relevant anymore, but I think understanding the n-gram
model is crucial in understanding a language model in general.
I’m going to use my favorite language model tool KenLM here. To set it up, you can follow this
$ git clone https://github.com/kpu/kenlm.git
$ cd kenlm
$ cmake -DCMAKE_CXX_FLAGS="-O3 -g" -Bbuild
$ make -Cbuild -j
$ sudo make install
Let’s start with the train set. For simplicity, we will consider all-lowercase word-level model. Create train.txt
file with the following
to be or not to be that is the question
to be or not to be is not the question
Then, train a 2-gram model with KenLM
$ lmplz --order 2--discount_fallback < train.txt > 2gram.arpa
Here, --discount_fallback
is so that we can train a model with this tiny corpus. Now, let’s take a look at the 2gram.arpa
language model
\data\
ngram 1=11
ngram 2=13
\1-grams:
-1.30103 <unk> 0
0 <s> -0.30103
-1.0532455 </s> 0
-0.89645934 to -0.42596874
-1.0532455 be -0.30103
-1.0532455 or -0.30103
-0.89645934 not -0.30103
-1.0532455 that -0.30103
-0.89645934 is -0.30103
-0.89645934 the -0.30103
-1.0532455 question -0.30103
\2-grams:
-0.26421693 question </s>
-0.24913573 <s> to
-0.40143394 not to
-0.18165989 to be
-0.5313119 be or
-0.24913573 or not
-0.50381577 is not
-0.7715207 be that
-0.7247773 be is
-0.24913573 that is
-0.6380301 not the
-0.50381577 is the
-0.26421693 the question
\end\
The model lists n-grams with their probabilities and back off weight (bow) as a text file. The first number is the conditional probability of the n-gram. Note that the probabilities are in the unit of log10. You can find arpa format specification here.
Let’s do an example to understand exactly how we read this. Say we want to calculate the probability of the following sentence
that is not the question
With n-gram, there are special tokens <s>
and </s>
that represent the start and the end of a string. We need to insert these into the sentence we want to evaluate
<s> that is not the question </s>
The first token is <s>
, so we would look up the unigram <s>
from the model. However, since every sentence starts with <s>
anyways, it is customary to skip it and start with a bigram starting with <s>
. In our case, it would be the bigram <s> that
.
Unfortunately, we can’t find this bigram within our language model, because the model has never seen any sentence that starts with that
token. In this case, we need to take the back-off path.
Here, α
is the back-off weight. In our case, it would be corresponding to the line
0 <s> -0.30103
where -0.30103
is the log10
of the back off probability. Now that we have taken care of <s>
token in the bigram we originally wanted to find, we look for the rest of the tokens, which would be a simple unigram of that
. This would correspond to this line
-1.0532455 that -0.30103
Again, the first number -1.0532455
is the log10
of the probability, which is what we need. Adding the two, we get the log10
probability of the bigram as -0.30103 + -1.0532455 = -1.3542755
.
Let’s proceed. The next token is is
, so we should look for the bigram that is
. This is present in the model
-0.24913573 that is
Let’s continue. All the rest bigrams exist.
-0.50381577 is not
-0.6380301 not the
-0.26421693 the question
-0.26421693 question </s>
So, the total log10
probability of the sentence would be simply the sum of all
-1.3542755 + -0.24913573 + -0.50381577 + -0.6380301 + -0.26421693 + -0.26421693
which turns out to be -3.273691
.
In fact, KenLM provides a useful tool query
that does it for us.
$ echo "that is not the question" | query 2gram.arpa
that=7 1 -1.3542756 is=8 2 -0.24913573 not=6 2 -0.50381577 the=9 2 -0.6380301 question=10 2 -0.26421693 </s>=2 2 -0.26421693 Total: -3.273691 OOV: 0
Perplexity including OOVs: 3.512490485315309
Perplexity excluding OOVs: 3.512490485315309
Voila! we get the same total log10
probability of -3.273691
. Not only that, we see the exact same per-token probability as we calculated.
Let’s do another example.
$ echo "that is that" | query 2gram.arpa
that=7 1 -1.3542756 is=8 2 -0.24913573 that=7 1 -1.3542756 </s>=2 1 -1.3542756 Total: -4.3119626 OOV: 0
Perplexity including OOVs: 11.967147698691244
Perplexity excluding OOVs: 11.967147698691244
See if you can get the same probability yourself. Note that you will need to use the back off for the bigram is that
.
In the next article, we will visualize this n-gram
model as a graph, which should be much more intuitive than .arpa
format.