n-gram language model — arpa

TechHara
4 min readOct 17, 2023

--

In the era of deep models, n-gram may not seem relevant anymore, but I think understanding the n-gram model is crucial in understanding a language model in general.

I’m going to use my favorite language model tool KenLM here. To set it up, you can follow this

$ git clone https://github.com/kpu/kenlm.git
$ cd kenlm
$ cmake -DCMAKE_CXX_FLAGS="-O3 -g" -Bbuild
$ make -Cbuild -j
$ sudo make install

Let’s start with the train set. For simplicity, we will consider all-lowercase word-level model. Create train.txt file with the following

to be or not to be that is the question
to be or not to be is not the question

Then, train a 2-gram model with KenLM

$ lmplz --order 2--discount_fallback < train.txt > 2gram.arpa

Here, --discount_fallback is so that we can train a model with this tiny corpus. Now, let’s take a look at the 2gram.arpa language model

\data\
ngram 1=11
ngram 2=13

\1-grams:
-1.30103 <unk> 0
0 <s> -0.30103
-1.0532455 </s> 0
-0.89645934 to -0.42596874
-1.0532455 be -0.30103
-1.0532455 or -0.30103
-0.89645934 not -0.30103
-1.0532455 that -0.30103
-0.89645934 is -0.30103
-0.89645934 the -0.30103
-1.0532455 question -0.30103

\2-grams:
-0.26421693 question </s>
-0.24913573 <s> to
-0.40143394 not to
-0.18165989 to be
-0.5313119 be or
-0.24913573 or not
-0.50381577 is not
-0.7715207 be that
-0.7247773 be is
-0.24913573 that is
-0.6380301 not the
-0.50381577 is the
-0.26421693 the question

\end\

The model lists n-grams with their probabilities and back off weight (bow) as a text file. The first number is the conditional probability of the n-gram. Note that the probabilities are in the unit of log10. You can find arpa format specification here.

Let’s do an example to understand exactly how we read this. Say we want to calculate the probability of the following sentence

that is not the question

With n-gram, there are special tokens <s> and </s> that represent the start and the end of a string. We need to insert these into the sentence we want to evaluate

<s> that is not the question </s>

The first token is <s>, so we would look up the unigram <s> from the model. However, since every sentence starts with <s> anyways, it is customary to skip it and start with a bigram starting with <s>. In our case, it would be the bigram <s> that.

Unfortunately, we can’t find this bigram within our language model, because the model has never seen any sentence that starts with that token. In this case, we need to take the back-off path.

https://en.wikipedia.org/wiki/Katz%27s_back-off_model

Here, α is the back-off weight. In our case, it would be corresponding to the line

0 <s> -0.30103

where -0.30103 is the log10 of the back off probability. Now that we have taken care of <s> token in the bigram we originally wanted to find, we look for the rest of the tokens, which would be a simple unigram of that. This would correspond to this line

-1.0532455 that -0.30103

Again, the first number -1.0532455 is the log10 of the probability, which is what we need. Adding the two, we get the log10 probability of the bigram as -0.30103 + -1.0532455 = -1.3542755.

Let’s proceed. The next token is is, so we should look for the bigram that is. This is present in the model

-0.24913573 that is

Let’s continue. All the rest bigrams exist.

-0.50381577 is not
-0.6380301 not the
-0.26421693 the question
-0.26421693 question </s>

So, the total log10 probability of the sentence would be simply the sum of all

-1.3542755 + -0.24913573 + -0.50381577 + -0.6380301 + -0.26421693 + -0.26421693

which turns out to be -3.273691.

In fact, KenLM provides a useful tool query that does it for us.

$ echo "that is not the question" | query 2gram.arpa
that=7 1 -1.3542756 is=8 2 -0.24913573 not=6 2 -0.50381577 the=9 2 -0.6380301 question=10 2 -0.26421693 </s>=2 2 -0.26421693 Total: -3.273691 OOV: 0
Perplexity including OOVs: 3.512490485315309
Perplexity excluding OOVs: 3.512490485315309

Voila! we get the same total log10 probability of -3.273691. Not only that, we see the exact same per-token probability as we calculated.

Let’s do another example.

$ echo "that is that" | query 2gram.arpa
that=7 1 -1.3542756 is=8 2 -0.24913573 that=7 1 -1.3542756 </s>=2 1 -1.3542756 Total: -4.3119626 OOV: 0
Perplexity including OOVs: 11.967147698691244
Perplexity excluding OOVs: 11.967147698691244

See if you can get the same probability yourself. Note that you will need to use the back off for the bigram is that.

In the next article, we will visualize this n-gram model as a graph, which should be much more intuitive than .arpa format.

--

--

TechHara

Passionate with software development. I write stories to help developers thrive.