Tokenization algorithms in Natural Language Processing (NLP)

Published in

Data Science in your pocket

7 min readMar 31, 2020

Natural Language Processing has been a hot field as most of the data coming from the side of the user is in unstructured form like free text, whether it is user comments (Facebook, Instagram), chats (Whatsapp, Hike), product reviews (Amazon, Flipkart) and the list can go on. There are a number of problems with this sort of data: wrong spellings, slang, emojis, etc. making the preprocessing of text data an extremely difficult task.

Hence this time, I am starting off with the very first task of preprocessing text data i.e. sentence/word segmentation. I will be going through Penn Treebank, Byte Pair Encoding, Wordpiece, Unigram Language model & SentencePiece.

But before moving on!! What is tokenization and why is it required?

Tokenization is breaking a text chunk in smaller parts. Whether it is breaking Paragraph in sentences, sentence into words or word in characters.

How do we understand the meaning of a sentence in English?

We read each word, interpret its meaning, and read the next word till we encounter a full stop. Hence, the model also needs all words separately to make up the sentence. So, if we are given a paragraph, we need to get all sentences. From all these sentences, we need words & then only we can move on to any sort of prediction.

Now, as we are clear with our aim & motive, let’s begin:

1. Penn TreeBank :

It is a rule-based tokenization method that separates out clitics ( words that normally occur only in combination with another word, for example in I’m), keeps hyphenated words together, and separates out all punctuation.

It is mainly used with treebanks released by Linguistic Data Consortium.

2. Byte Pair Encoding:

Have you ever seen how Chinese is written?

姚明进入总决赛
“Yao Ming reaches the finals”

Punctuations are rare in certain languages!!

Hence, at times, each character used is taken as a token in Chinese tokenization.

But wait a minute!!

Because of this, we may need a number of tokenizers for different languages as we take words separated by space as tokens in English but something else in languages like Chinese. We might need to read ‘South Africa’ to read as a single token at times!!

This brought up the idea of subword tokenization i.e. most tokens are words, but some tokens are subwords like -er (few-er, light-er), -ed, etc. So the tokens learned can either be characters or words or subwords depending upon the corpus.

This also helps us in other ways as well as if we can tokenize ‘fewer’ or ‘fewest’ as few-er & few-est, they won’t be taken up as two completely different words by the ML model we will train after preprocessing. Hence, such tokenization can help us in handling Unknown words in the test dataset.

P.S.- It can’t solve the ‘South Africa’ problem as the algorithm is run inside words (we don’t merge across word boundaries)

Let’s have a quick example and consider the below scenario:

Here,

1st Column: frequency of the word in the corpus,
2nd Column: Dictionary represents the words corresponding to the frequency in the corpus.
3rd Column Vocabulary is the set of initial tokens extracted using all unique characters used in the dataset. The end of the sentence is represented by the underscore ‘_’.
Now, using the vocabulary, we will iteratively merge tokens to get new tokens depending upon the frequency they are present in the data.

STEP 1 :

As the frequency of ‘r_’ is highest (6+3), we combine ‘r_’ to form a new token.

STEP 2:

As frequency of ‘er_’ is highest (6+3), we get ‘er_’

STEP 3:

As the frequency of ‘ew’ is highest (6+2), we get ‘ew’

Likewise:

We will continue merging till we get a defined number of tokens (hyperparameter).

3. WordPiece:

Byte Pair Encoding falters outs on rare tokens as it merges the token combination with maximum frequency.

What can be done? WordPiece

WordPiece Tokenization is almost similar to Byte Pair Encoding. The only difference lies in the way we merge up the tokens to produce new tokens. While in BPE, we use the max frequency of the new token for merging but in WordPiece, we also take into consideration the frequency of two tokens separately used for merging apart from the frequency of the new token. This means if we have 2 tokens A & B, we will be calculated:

Score(A,B) = Frequency(A,B)/Frequency(A)*Frequency(B) and max value pair will be selected

This definitely has an edge over Byte Pair Encoding.

It might be the case that Frequency(‘so’,’ on’) is very high but their separate frequencies are also high, hence if using WordPiece, we won’t be merging ‘soon’ as the overall score might go low. Again if the Frequency(‘Jag’,’gery’) might be low but if their separate frequencies are also low, we will merge to form ‘jaggery’.

4. Unigram Language Model:

Let’s see the below example:

Even if we merged ‘face’+’d’ over ‘no’+’d’, it doesn’t mean we won’t encounter ‘nod’ ever. It just means that the probability of ‘nod’ is lesser but not 0. But if we use WordPiece, we might miss this.

So, what do to now?

What we need is a model that provides us with the probability of different tokens that can be generated. This simply means if we get the word ‘nod’, we shall get probabilities for ‘no+d’, ‘n+od’,’ nod’ but with different probabilities given the case, these tokens (no,d, n, od, nod) are given in the initial vocabulary.

Let’s explore the Unigram Language model now

x: Sentence

xi: subword forming sentence

V: Vocabulary

What are we doing is calculating the Probability(Sentence) by multiplying the probabilities of subwords.

Take an example(Just assume the mentioned subwords are in the vocab V):

P(‘I cried in the courtyard’) = P(‘I’) * P(‘cri’) * P(‘ed’) ….P(‘court’) * P(‘yard’)

But as you can see, the same statement can be segmented in a number of ways we can use P(‘cried’) instead of P(‘cri’) * P(‘ed’) if ‘cried’ is in V.

Which sequence to use?

The Viterbi algorithm helps us choose the right sequence.

But how do we get the probabilities of subwords like ‘ed’, ’er’, etc.?

We are on the same page as we started off before the Unigram Language Model.

Have you heard of Expectation-Maximization algorithm? If not, explore and have an idea here.

What we do is assume P(subwords) is a hidden variable (variables that can’t be directly calculated but can just be inferred i.e. judged on the basis of some by either reasoning or evidence) and hence try to maximize the below function which helps us in calculating P(subwords):

S(X) = Set of segmented tokens from a sentence.

The inner summation is sum of P(subwords) forming a sentence & outer summation is sum of P(X) for all sentences in the corpus.

Very tough!!! urghhhhhhhh

But what if I tell you where are we going to get vocabulary?

Below steps can be followed for creating a vocabulary.

Here p(x) refers to P(subwords), V: Vocabulary, L: Likelihood to be maximized

Note: An issue with the Unigram language model is it assumes each subword occurs independently.

5. SentencePiece:

Let us explore the below problems:

Have we got an answer to tokenizing Chinese, Japanese, & other languages which rarely use punctuation?
If I tokenize ‘I____am____legend’ using the above methods, we will get ‘I, am, legend’. The tokens remain the same for ‘I_am_legend’. As the information about extra spaces isn’t preserved in tokens generated, this is called lossy tokenization as we are missing on some information. How to cope with this? (take __ as space)

SentencePiece comes to the rescue!!

BPE and wordpiece both assume that we already have some initial tokenization of words (such as by spaces, or from some initial dictionary). By contrast, the SentencePiece model works from the raw text; even whitespace is handled as a normal symbol. Thus it doesn’t need an initial tokenization or word-list and can be used in languages like Chinese or Japanese that don’t have spaces.

It mainly has 4 parts:

Normalizer: Normalizes semantically equivalent(similar meaning) Unicode characters to a standard form.
Trainer: It trains a model for subword segmentation given the corpus.
Encoder: It internally executes Normalizer to normalize the input text and tokenizes it into a subword sequence with the subword model trained by Trainer.
Decoder: It converts the subword sequence into the normalized text.

Hence, we can say :

Decoder( Encoder (Normalizer(corpus))) = Normalizer (corpus)

We call this design lossless tokenization, in which all the information to reproduce the normalized text is preserved in the encoder’s output.

Hence, in ‘I___am___legend’, extra spaces will be preserved by the encoder which can be handy.

And this will be a wrap on my side. Next, I will try to go through how a POS Tagger works using HMM (Hidden Markov Model) and Viterbi algorithm.