n-gram language model — state machine

4 min readOct 19, 2023

In the previous article, we learned how to make sense of n-gram language model and its arpa format. Today, we will visualize the n-gram model as a graph. More precisely, we will use OpenFst and OpenGrm NGram library to convert the model from the arpa format to finite state transducer (fst) format.

If you don’t already have these installed, you can install necessary libraries to follow the article

$ sudo apt install libfst-tools libngram-tools graphviz -y

What is finite state transducer? The the name may sound too scary, but it is just a state machine with transition weights. For each token in the language, we will jump to a new state and collect the transition weight on the way. These transition weights directly correspond to the conditional probability of n-grams and back off weights.

Seeing it in action will be easier to understand. Let’s take our 2-gram model in the arpa format and convert to fst format. As a recap, here is our 2gram.arpa file

\data\
ngram 1=11
ngram 2=13

\1-grams:
-1.30103 <unk> 0
0 <s> -0.30103
-1.0532455 </s> 0
-0.89645934 to -0.42596874
-1.0532455 be -0.30103
-1.0532455 or -0.30103
-0.89645934 not -0.30103
-1.0532455 that -0.30103
-0.89645934 is -0.30103
-0.89645934 the -0.30103
-1.0532455 question -0.30103

\2-grams:
-0.26421693 question </s>
-0.24913573 <s> to
-0.40143394 not to
-0.18165989 to be
-0.5313119 be or
-0.24913573 or not
-0.50381577 is not
-0.7715207 be that
-0.7247773 be is
-0.24913573 that is
-0.6380301 not the
-0.50381577 is the
-0.26421693 the question

\end\

We will invoke ngramread utility from opengrm

$ ngramread --ARPA 2gram.arpa 2gram.fst

This creates a binary model file 2gram.fst that represents the language model as a finite state machine. To visualize the graph, run the following command

$ fstdraw --portrait --acceptor 2gram.fst | dot -Tpdf > 2gram.pdf

Now, let’s evaluate probability of a sentence that is not the question, just like before. For that, we start from the start state (bold circle), which is 1. The first token is that, so we try to find a transition from state 1 with the matching token. Unfortunately, we only see transitions with tokens <epsilon> and to. What is this <epsilon> token?

<epsilon> transition precisely corresponds to the back off in n-gram. You can think of this as a fallback transition —an alternate path you take only if you can’t find the one you are looking for. Note the weight 0.69315 along the path.

Before, the back off weight for the bigram <s> that was -0.30103, but why is it 0.69315 here? That is because arpa format uses log10(prob), whereas fst format uses -ln(prob). If we multiply the by a constant factor -ln(10), we will get the same weight

-0.30103 * ln(10) = 0.69315

Next, we are in state 0. We still haven’t consumed the first token that. Let’s again find a transition from state 0 associated with the token that. There is a transition to state 6 with weight of 2.4252.

We continue doing this until the end. Below table summarizes our transitions and associated weights we collect along the way

calculating probability of “that is not the question”

You may wonder what is the row right above sum. That is essentially </s> token, which you can’t find in the fst format. Instead, fst format uses exit weight or final weight, which is the number associated with those states with double-ring, such as state 0 and 9. These weights are added when we are exiting the state. If your are at a state that is not an exit state (i.e., not a double-ring), then you have to keep making epsilon transitions until you land on an exit state. Only at an exit state, you can complete the journey.

If we add up all the weighs along the path and convert to log10, we see 3.2737, exactly what we got using arpa format. Not only the sum matches, but also each step matches precisely to what we calculated from the arpa format. That is because there is a one-to-one equivalence relation between arpa and fst language model formats.

As an exercise, try to evaluate the probability of a sentence that is that using the graph and see if the result matches what we obtained before.

n-gram language model — state machine

Written by TechHara