Exploring N-Grams: The Building Blocks of Natural Language Understanding

11 min readMar 8, 2024

Part 6: Strategies for Dealing with Out-of-Vocabulary Words.

Handling Out-of-Vocabulary words (OOV): Out-of-Vocabulary (OOV) words are those that do not appear in the training data or the predefined vocabulary but might appear in the test data. Handling Out-of-Vocabulary (OOV) words in N-gram models is crucial as these are words unseen during training. They can cause errors or reduce the performance of NLP models. Out-of-Vocabulary (OOV) words can indeed cause issues when generating N-grams for tasks like text classification, machine translation, or sentiment analysis. This is because these words do not have any corresponding vector representation in the model’s vocabulary that was built during training. Handling OOV words is a common problem in N-gram models. Here’s how it can be addressed:

Vocabulary Expansion: Dynamic Vocabulary: Continuously update the vocabulary based on new data encountered during testing. Example: If the model sees “forest” during testing, it adds it to the vocabulary for future predictions. However, it’s important to note that while this approach can help improve the model’s performance, it also increases the size of the vocabulary, which could lead to higher computational costs. Therefore, it’s crucial to find a balance between expanding the vocabulary to handle OOV words and managing computational resources effectively.

Microsoft Word’s spell checker is a great example of vocabulary expansion in practice. When it encounters a word it doesn’t recognize, it underlines it as a potential spelling error. However, if the word is correct (for example, it could be a technical term or a name), you can right-click the word and choose to add it to the dictionary. This way, the spell checker expands its vocabulary and will not flag that word as an error in the future. This is a practical example of how systems can learn and adapt to new words or terms they haven’t encountered before.

Character-level or subword-level N-grams:

1. Subword-level Models: These models represent words as a combination of subword units. Subwords are parts of words that by themselves may or may not be valid words. For instance, if we have a subword-level bigram model and we’re trying to predict the next subword after ‘un’, the model would look at the probability of ‘happiness’ given ‘un’, i.e., P(happiness|un). Let’s consider another example, “unhappiness” could be split into the subwords “un”, “happiness”. This approach can capture the internal structure of words, like prefixes and suffixes. For instance, if we have a subword-level bigram model and we’re trying to predict the next subword after ‘un’, the model would look at the probability of ‘happiness’ given ‘un’, i.e., P(happiness|un). This reduces the vocabulary size and increases the coverage of rare or unknown words. Subword tokenization techniques like Byte-Pair Encoding (BPE) or SentencePiece can segment words into smaller, meaningful components based on frequency and dataset occurrence. For example, the word ‘unbelievable’ can be split into subword units like ‘un’, ‘believ’, and ‘able’.

2. Character-level Models: These models treat text as a sequence of characters rather than a sequence of words. For example, the word “cat” would be represented as the sequence of characters ‘c’, ‘a’, ‘t’. The bigrams in this sequence are: (“c”, “a”) and (“a”, “t”). If we want to predict the next character after “ca”, the model would look at the probability of “t” given “a”, i.e., P(t|a). Now, let’s say we have another sequence: “car”. The bigrams are: (“c”, “a”) and (“a”, “r”). To predict the next character after “ca” in this sequence, the model would look at the probability of “r” given “a”, i.e., P(r|a). So, if we have a new sequence “ca”, and we want to predict the next character, the model would consider both P(t|a) and P(r|a), and choose the character with the higher probability.

However, character-level models often require more computational resources because they deal with longer sequences (since there are more characters than words in a given text). For instance, if we have a character-level bigram model and we’re trying to predict the next character after ‘c’, the model would look at the probability of ‘a’ given ‘c’, i.e., P(a|c), and so on.

3. UNK Token Replacement: The “UNK” token is a common strategy used in Natural Language Processing (NLP) to handle Out-of-Vocabulary (OOV) words. The basic idea is to replace infrequent words or words that are not in the training set vocabulary with a special token like “UNK” (unknown) during training and testing. This way, the model can learn a representation for unknown words. Suppose the sentence “I went to the park” is in the training corpus. During testing, if the model encounters “forest,” an unseen word, it’s replaced by “<UNK>”.

Let’s consider another example. Suppose we have a sentence: “I have a pet dinosaur”. If our model’s training data never included the word “dinosaur”, this word would be an OOV word when we try to use the model on this sentence. Without handling for OOV words, our model would not know what to do with “dinosaur”. But if we replace all OOV words with “UNK”, our sentence becomes: “I have a pet UNK”. Now our model can process this sentence, because it has learned a representation for “UNK” during training.

It’s important to note that the choice of which words to replace with “UNK” can have a big impact on performance. If we replace too many words, we lose valuable information. If we replace too few, we may still have issues with OOV words. A common strategy is to replace words that appear below a certain frequency threshold in the training data.

Before proceeding further, let’s delve into the concept of padding, which is a prerequisite for this discussion.

Imagine trying to feed sentences of varying lengths into a machine learning model. Most machine learning models require fixed-size inputs. This can be a problem when working with text data because sentences can vary in length. The model requires uniformity in the input data, and that’s where padding comes to our rescue. Padding aids in managing sequences of varying lengths. Padding is a technique used to make all sequences in a batch the same length by adding a special symbol (often 0 or a special token like ‘<PAD>’) at the end (post-padding) or the start (pre-padding) of the shorter sequences. In this way, even if the original sentences vary in length, they now appear to be of the same size to the model. Let’s consider a dataset with sentences of varying lengths:

let’s consider three different sentences:

“I like to read.”
“Books are interesting.”
“Reading helps in gaining knowledge and improving vocabulary.”

After tokenization, these sentences might look like:

[“I”, “like”, “to”, “read”]
[“Books”, “are”, “interesting”]
[“Reading”, “helps”, “in”, “gaining”, “knowledge”, “and”, “improving”, “vocabulary”]

As you can see, the sentences have different lengths. To make them the same length for a machine learning model, we can use padding. Let’s pad these tokenized sentences to match the length of the longest sentence:

[“I”, “like”, “to”, “read”, 0, 0, 0, 0]
[“Books”, “are”, “interesting”, 0, 0, 0, 0, 0]
[“Reading”, “helps”, “in”, “gaining”, “knowledge”, “and”, “improving”, “vocabulary”]

Now, all the sentences are of the same length and can be used as input to a machine learning model. The ‘0’ is a padding token that we’ve used to fill in the extra spaces in the shorter sentences.

Let’s consider another example, to ensure uniformity, we’ll pad the sentences to match the length of the longest sequence.

Original Sequences:
“I love machine learning.” (4 words)
“The cat jumped.” (3 words)
“Deep learning is fascinating and complex.” (6 words)
Padding to Max Length (6 words):
“I love machine learning. <PAD> <PAD>”
“The cat jumped. <PAD> <PAD> <PAD>”
“Deep learning is fascinating and complex.”

Now, all sequences are of equal length (6 words) by adding ‘<PAD>’ tokens at the end.

Here’s how you can perform padding in Python using the tensorflow library:

from tensorflow.keras.preprocessing.sequence import pad_sequences

# Example sequences with varying lengths
sequences = [
    [10, 15, 20],
    [5, 8, 12, 18, 25],
    [30, 35]
]

# Padding sequences to a maximum length of 5 elements
max_length = 5
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post')

print("Original Sequences:")
print(sequences)

print("\nPadded Sequences:")
print(padded_sequences)

Output looks like:

Original Sequences:
[[10, 15, 20], [5, 8, 12, 18, 25], [30, 35]]

Padded Sequences:
[[10 15 20  0  0]
 [ 5  8 12 18 25]
 [30 35  0  0  0]]

Note — When dealing with variable-length strings, it’s essential to specify dtype=object to handle strings of different lengths.

# Define sequences
sequences = [
    ['I', 'love', 'programming'],
    ['I', 'am', 'a', 'programmer'],
    ['I', 'like', 'AI']
]

# Padding sequences to a maximum length of 5 elements
max_length = 5
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post', value='<PAD>', dtype=object)

print("Original Sequences:")
print(sequences)

print("\nPadded Sequences:")
for seq in padded_sequences:
    print(seq)

Output of above code looks like-

Original Sequences:
[['I', 'love', 'programming'], ['I', 'am', 'a', 'programmer'], ['I', 'like', 'AI']]

Padded Sequences:
['I' 'love' 'programming' '<PAD>' '<PAD>']
['I' 'am' 'a' 'programmer' '<PAD>']
['I' 'like' 'AI' '<PAD>' '<PAD>']

Now let’s see an example of generating bigrams while handling OOV words using NLTK:

import nltk

text = "The cat sat on the mat. The mat was comfortable."

# Tokenize the text into words
tokens = nltk.word_tokenize(text)

# Generate bigrams while handling OOV words
bigrams = list(nltk.ngrams(tokens, 2, pad_left=True, pad_right=True, left_pad_symbol='<PAD>', right_pad_symbol='<PAD>'))
print(bigrams)

Output may look like –

[('<PAD>', 'The'), ('The', 'cat'), ('cat', 'sat'), ('sat', 'on'), ('on', 'the'), ('the', 'mat'), ('mat', '.'), ('.', 'The'), ('The', 'mat'), ('mat', 'was'), ('was', 'comfortable'), ('comfortable', '.'), ('.', '<PAD>')]

Note — The parameters pad_left, pad_right, left_pad_symbol, and right_pad_symbol in NLTK’s ngrams function serve specific purposes:

Note — In n-gram models, <START> and <END> are special symbols that represent the beginning and the end of a sentence or a document. They are used to mark the boundaries of a sentence and to denote the start and end of a sequence of words. For example, consider the sentence “I love playing football.” In a bigram model, this sentence would be represented as (“<START>”, I), (I, love), (love, playing), (playing, football), (football, “<END>”). Here, <START> and <END> are used to indicate the beginning and end of the sentence. In higher n-gram models, words near the start of each sentence will not have a long enough context to apply the formula for calculating the n-gram probability. To make the formula consistent for those cases, these n-grams are padded with sentence-starting symbols <START>. For example, consider a trigram model (n=3). For the sentence “I love dogs”, the trigrams would be: <START> <START> I, <START> I love, I love dogs, love dogs <END>. Here, the <START> tokens provide context for the word “I”, and the <END> token indicates that “dogs” is the last word in the sentence.

Back-off strategy — This strategy is employed when the probability of a higher-order N-gram is not available due to insufficient training data, and it involves “backing off” to lower-order N-grams. Suppose we have the trigram “cats chase mice”, but it never occurred in our training corpus. In this case, we cannot compute the probability of “mice” given “cats chase”. So, we back off to the bigram “chase mice” and compute the probability of “mice” given “chase” instead. The back-off strategy operates under the principle that a sequence of words is likely to occur if a similar sequence of words has occurred before. Therefore, if a specific n-gram is not observed during training, the model can fall back to a lower order n-gram that has been seen before. For instance, if the sequence “the cat sat” is not found in the training data, the model can use the information from “the cat” to predict the next word.

Let’s consider a trigram language model. In this model, the probability of a word wi is determined by the two preceding words wi-2 and wi-1. The maximum likelihood conditional probability of wi given wi-2 and wi-1 is defined as the count of occurrences of the trigram divided by the number of occurrences of its bigram constituent1.

However, there might be cases where a particular trigram has never occurred in the training data. This is where the back-off strategy comes into play. If a trigram (wi-2, wi-1, wi) has not been seen in the training data, we can back-off to a bigram model and estimate the probability based on (wi-1, wi). If the bigram has also not been seen, we can further back-off to a unigram model and estimate the probability based on wi alone.

Here’s a mathematical representation of this:

Let’s denote the count of a trigram (wi-2, wi-1, wi) as c(wi-2, wi-1, wi), the count of a bigram (wi-1, wi) as c(wi-1, wi), and the count of a unigram wi as c(wi).

where N is the total number of words in the training data.

This way, the back-off strategy allows us to estimate probabilities even for unseen n-grams by backing off to smaller n-grams.

Relying solely on a single n-gram order can indeed be limiting. Each n-gram order comes with its own set of advantages and limitations. Higher-order n-grams, such as trigrams, are capable of capturing longer-range dependencies and intricate patterns in the text. However, they are also more susceptible to data sparsity issues. Since they require occurrences of specific sequences to make accurate predictions, they may struggle with infrequent or unseen combinations of words.

On the other hand, lower-order n-grams, like unigrams or bigrams, are more robust against data sparsity because they rely on shorter contexts and occur more frequently in the training data. However, they lack the rich context provided by higher-order n-grams, which can limit their ability to capture complex language patterns and long-range dependencies.

To address these challenges, language models use interpolation, a technique that combines information from different n-gram orders. Interpolation is a method used to combine information from different n-gram orders. It assigns weights to probabilities from different n-gram orders and combines them linearly. This allows the model to leverage information from both higher and lower-order n-grams effectively. This technique combines probabilities from different n-gram orders, typically using a weighted linear combination. The weights are learned during training and determine the relative influence of each n-gram order on the final probability.

After using interpolation to combine probabilities from different n-grams, the resulting combined probability might not sum up to exactly 1 due to rounding errors or slight inconsistencies. To ensure the final probability distribution is valid, normalization is applied. This process scales the combined probabilities so that they all add up to 1, representing a valid probability distribution for the sentence. This is crucial because probabilities must always sum up to 1 for any given set of events or outcomes.

Closing note — As we’ve journeyed through part 6, take a moment to reflect on the insights we’ve uncovered. Your journey of learning and growth is as unique as the path you choose. As we bid adieu to this section, let’s keep our curiosity alive and thriving. Let’s continue this exploration together in Part 7 -Evaluating Language Model Performance.

Exploring N-Grams: The Building Blocks of Natural Language Understanding

Written by om pramod