Tokenizers — Your first step into NLP

Arnab
Analytics Vidhya
Published in
8 min readOct 2, 2021

One of the starting points of an NLP pipeline explained

Photo by Kevin Ku on Unsplash

Introduction: Machines v/s Humans — How do we read?

For us, language is trivial. We’ve heard it since the day we’ve been born. We understand:

  1. Phonetics: The sound of the speech helps us perceive what it means.
  2. Syntax: The combination of words and phrases that make up a sentence.
  3. Semantics: The meaning of each of the words, phrases & sentences

Now, once you show a computer a sentence, it has no idea about the structure and the meaning of the sentence. Somehow, we have to encode the meaning at a word, phrase, and sentence level.

Can we encode rules? Can we say, “Nouns must agree with their verbs, which means that a singular noun requires a singular verb, and a plural noun requires a plural verb” and so and so forth? Well, that would take a long list of rules and make our jobs harder. With the advent of ML & Deep learning it turns out, models can figure out these rules and meanings on their own.

Word Based Tokenization

The first step would be to break down the text into “chunks” and encoding them numerically. This numerical representation would then each have a vector representation ( Word Embeddings, you can check out an entire article I wrote about that) which the model would learn.

Whitespace Tokenization:

At its simplest form, we split the entire text on whitespaces. This would separate it into words.

Let’s take an example from the harry potter novels for our vocabulary. All the words our model would know would come from this piece of text.

Here’s an abbreviation.

/ \n\n\n\n\nTHE BOY WHO LIVED \n\nMr. and Mrs. Dursley, of number four, Privet Drive, \nwere proud to say that they were perfectly normal, \nthank you very much. They were the last people you’d \nexpect to be involved in anything strange or \nmysterious, because they just didn’t hold with such \nnonsense. \n\nMr. Dursley was the director of a firm called \nGrunnings, which made drills. He was a big, beefy \nman with hardly any neck, although he did have a \nvery large mustache...

At this point, for simplicity, we would want to get rid of the punctuations and \nto clean the text by word.

Regex to match any non alphanumeric character or whitespace, ie punctuation and replacing with empty ''punc = re.sub('[^\w\s]', '', string)Removing newlines \n and applying lowercase.newlines=re.sub('[\n+]','',punc).lower()    

The code above converts the paragraph to something like this.

c=Corpus(path) 
c.tokenize()
2D vector where each element is a sentence. [0,1,2,3] corresponds to "the boy who lived"--------------------------------------------------------------------[[], [], [], [], [], [0, 1, 2, 3], [], [4, 5, 6, 7, 8, 9, 10, 11, 12], [13, 14, 15, 16, 17, 18, 13, 19, 20], [21, 22, 23, 24, 18, 13, 0, 25, 26, 27], [28, 15, 29, 30, 31, 32, 33, 34], [35, ...

Each number will correspond to some d-dimensional vector which the model will learn as it’s representation.

example with the embedding dimension as 3

But here’s the problem

df[df.word.str.startswith('brave')]
Base word brave

For words like brave,bravery,bravely,braverthe model assumes separate embeddings which are probably a waste of space. Also, if every single unique word is used for our vocabulary, it might get excessively large.

There are also the complications of punctuations which I avoided. Ideally, you would like to keep some punctuation that adds meaning to a sentence. Characters like ? could be stripped and tokenized individually.

Libraries like nltk which have built-in tokenizers handle punctuations as follows.

Rule-based tokenization:

from nltk.tokenize import word_tokenize
word_tokenize("Do you like this? I don't")
--------------------------------------------------------------['Do', 'you', 'like', 'this', '?', 'I', 'do', "n't"]

Notice how ‘?’ is a separate token. And “don’t” is split into “do” & “n’t”. This is due to the fact that “don’t” comes from “do not”. Such splitting is often referred to as rule-based tokenization.

Punctuation-based tokenization

If you separate by punctuation dont will be split to donand tthe punctuation ' will be a separate token.

All these word-based tokenizers are faced with the fundamental issue of massive vocabulary size. And if you incorporate rules to shorten your vocabulary, like only taking frequent words — you are faced with the issue of the unktoken for unknown words at the time of testing/generation.

But there’s a simple solution right? Characters! Atleast for the english language, there are way fewer characters than words.

Character-Based Tokenization:

I’ve covered a little bit of this in this article. But the idea is just to have mapped to every character (or special character & punctuation) and learn an embedding for each character.

Image Credits: https://blog.floydhub.com/tokenization-nlp/

As shown in the example above, the model would learn a representation for each character I s n t.. separately. This would solve both of our initial problems.

  1. Vocab size: There are only a few characters we need to consider. We can mix all of them to form any word.
  2. Out of vocabulary words: In the purely word-based tokenization, there's a chance that at testing time, we will encounter words that we haven’t seen before. For instance, if we have a rare word like Gobbledygookin our test set, instead of assigning the unktoken to it, it would combine all the individual characters and form a representation for it.

But obviously, there are problems. In the case of words, each representation had a meaning. What does the vector representation of a single character really mean?

Also, our sequences get unnecessarily long. For a single sentence of 7 words, our word-based tokenizer would produce 7 tokens. For the same sequence, a character-based tokenizer would produce ~35–40 tokens depending on how big the words were. (This would’ve been a bigger issue with sequential models, where our inputs need to be processed one after the other. But now we’ve got Transformers, where inputs are processed parallelly.)

So, we move on to the sweet spot subword-based tokenizationwhere we (kind of) take a combination of both of our character-based and word-level tokens.

Sub-word Based Tokenization:

The main idea here is that rarer words would be broken down into sub-words, which the others remain as it is. For example, a common word in the corpus such as Harrywould be left alone as it is. And rarer words like gryffindorwould be broken up into subwords such as gr,y,ffin,dor

Do you notice something the suffixes y,r,ly,est,erwhen added to other words may mean something to? For example, Tallest/Strongest has a common suffix est which makes it superlative. So perhaps learning representations for estwould help us.

Byte Pair Encoding:

Byte pair encoding is a form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data.

For instance, if we had bytes like bbdcdcae

  1. Look at the pair dc repeated twice
  2. We replace cd with X, so we have bbXXae
  3. Next, we have a pair bbwhich we can replace with say Y, YXXae
  4. Now no more pair exists, and we can stop.

For tokenization using BPE, we apply an extremely similar algorithm.

  1. Start with the character vocabulary.
All tokens: dict_keys(['/', '</w>', 'T', 'H', 'E', 'B', 'O', 'Y', 'W', 'L', 'I', 'V', 'D', 'M', 'r', '.', 'a', 'n', 'd', 's', 'u', 'l', 'e', 'y', ',', 'o', 'f', 'm', 'b', 'P', 'i', 'v', 't', 'w', 'p', 'h', 'c', 'k', '’', 'x', 'g', 'j', 'G', '|', '2', 'S', '-', 'J', 'K', 'R', ';', 'N', 'A', '“', '”', '—', 'F', 'z', '?', '3', '!', 'q', '4', 'C', '5', '6', '(', ')', ':', '7', '8', '9', '1', '0', 'U', '"', '\\', '‘', '•', 'Z', 'Q', '■', "'"])No. of tokens : 83

Calculate how the frequency of each token.

....'c': 6372,            
'd': 16279,
'e': 41351,
'f': 6429,
....

2. Split all the words according to the base vocabulary( which are all characters at the moment), mark the end of each with a token (So that suffixes and prefixes could be separated, for instance, with Real</w> & algebra</w> if al</w> was the token, it must’ve come from Real</w>).

T H E </w> : 17 
B O Y </w> : 1
W H O </w> : 1
L I V E D </w> : 1
M r . </w> : 79
a n d </w> : 2139
M r s . </w> : 44
D u r s l e y , </w> : 6
o f </w> : 1233
n u m b e r </w> : 14
...

3. Calculate pairwise-frequencyfor the current words according to the current vocab.

('/', '</w>') : 4 
('T', 'H') : 28
('H', 'E') : 21
('E', '</w>') : 33
('B', 'O') : 4
('O', 'Y') : 1
('Y', '</w>') : 8
('W', 'H') : 6
('H', 'O') : 7
('O', '</w>') : 6
('L', 'I') : 3
('I', 'V') : 2
('V', 'E') : 6
('E', 'D') : 9
('D', '</w>')
...

Each of the two letters was repeated with the frequency mentioned on its right.

4. Calculate the pair with the maximum frequency and merge. Add this to the current vocab.

('e', '</w>'): 13356 # these will be merged in this caseAll tokens: dict_keys(['/', '</w>', 'T', 'H', 'E', 'B', 'O', 'Y', 'W', 'L', 'I', 'V', 'D', 'M', 'r', '.', 'a', 'n', 'd', 's', 'u', 'l', 'e', 'y', ',', 'o', 'f', 'm', 'b', 'P', 'i', 'v', 't', 'w', 'e</w>', 'p', 'h', 'c', ...No of tokens: 84Current list of words'T h e y </w>': 155,  
't h e</w>': 3654,
'l a s t </w>': 61,

Now we could combine e</w> with other characters to form the next pairs. We also need to recalculate the token frequencies

...'c': 6372,         
'd': 16279,
'e': 27995,
'e</w>': 13356,
...

Notice how the frequency of ehas changed.

5. Repeat. Choose the number of merges as a hyper-parameter to your suiting. This way after a lot of merges, the common words would appear unchanged along with sub-words which would could then be combined to help form unknown words.

Once you have the list of tokens, for an unknown word we can combine the subwords found in this process to form representations.

WordPiece:

Once you’ve learned about BPE, there are only a few tweaks needed to learn WordPiece. This is the tokenization method used in BERT, one of the most successful NLP papers in recent times.

The main difference comes with the idea of merging tokens. In BPE, we just take the most frequent pair & merge it.

In contrast,wordpiece merges, only when it increases the likelihood of the training data. For example, if “er” is more likely to appear than e & r, they are merged. You can read more about this here.

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.tokenize("Slytherin is the best house ever!")
Output:
['sly', '##ther', '##in', 'is', 'the', 'best', 'house', 'ever', '!']

The ## indicates that the word didn’t start with ther. “Syltherin” was the unknown word in BERTs vocabulary. But it had tokens corresponding to ##ther , ##in & sly. Thus, it broke the word down with the tokens it knew.

Resources:

  1. https://huggingface.co/transformers/tokenizer_summary.html
  2. https://blog.floydhub.com/tokenization-nlp/
  3. https://leimao.github.io/blog/Byte-Pair-Encoding/

--

--