Understanding NLP Keras Tokenizer Class Arguments with example

Akash Deep
Analytics Vidhya
Published in
6 min readAug 22, 2020

As we all know preparation of Input is very important step in complete deep learning pipeline for both image and text related problems. In this blog we will try to understand one of the most important text preprocessing technique called Tokenizer along with the parameters i.e available in keras.

First, we will try to understand what tokenizer bascically mean with a simple example.

Machine doesn’t understand text so we need to convert the text in machine readable language and that is nothing but the numbers. To convert text into numbers we have a class in keras called Tokenizer. Have a look in below simple example to understand the context more clearly

The sentence “I love deep learning” will be assigned as below numbers by Tokenizer in keras

I — 2,love — 4,deep — 1,learning — 3(Note:it used to assign numbers based on alphabetical order.

Like this if we will pass a huge dataset with lots of documents into the keras tokenizer it will convert all the text words into sequence of numbers. We had understand the concepts of token, Now, we will try to understand how we can implement it using Keras API on the large dataset.

For this we need to first import tokenizer class from keras text preprocessing using below code

from tensorflow.keras.preprocessing.text import Tokenizer

As soon as we have imported Tekenizer class now we will be creating a object instance of Tokenizer class. After creating object instance we will be using method called “fit_on_texts” on the created object instance and will be passing the sentence or large dataset as a parameter in “fit_on_texts” method. One thing to note down the fit_on_text method accepts list, so we need to convert the sentence or the columns in the dataset on which we are planning to apply tokenizer to list. This can be done by below code.

Code to apply token on the sentences

Now we will try to understand the important class Arguments of the Tokenizer in line 3 of the above code. Below are the list of arguments :

Keras Tokenizer arguments

First argument is the num_words. In our example we have used num_words as 10. num_words is nothing but your vocabulary size. We need to be very cautious while selecting this parameter because this will results in the performace of the model.By default the value of num_words is none. The best value is to use for the num_words is “len(tokenizer.word_index) + 1".

Second one is filters i.e a string in which all elements are character that will be filtered from the input text. By default the value is all punctuation, plus tabs and line breaks, minus the ' character.

Third one is lower i.e a boolean value which states whether to convert all the passed text to lower case or not. By default it is set to True.

Fourth one is split which signifies that in the given text will be splited from the specified seperator. For example In sentence “I/love/deep/learning” if i will select the seperator= “/” the resultant output will be [‘I’, ‘love’, ‘deep’, ‘learning’]

Fifth one is char_level, By default it is False. It means its asking whether we want tokenization as char level or word level. If we provide char_level as True then our example will be tokenize like this “I=1,l-2,o-3,v-4,e-5,m-6,a-7 and so on”. High level the tokenization operation will take place at character level so in general we need to set this parameter as False.

Sixth one is oov_tokens. It is one of the most important Argument and by default it is None, but its suggested we need to specify “<OOV>”, because when we will be performing text_to-sequence call on the tokenizer object what we have created earlier it will replace with “<OOV>” word to all the out of vocabulary words. text_to_sequence is nothing but it converts the generated tokens in the sequence as per the sentence that we had feed to the tokenizer class. It will be more clear from below snippet.

text_to sequence

From the above example we can clearly see the sequence of the text that we have feeded to the Tokenizer is now converted in the sequence of numbers. We can refer the previous snippet output to compare.

We are very close to make the data ready for the input to our Neural Networks layers as we had understood how to convert words into token and how to convert the sequence of word in sequence of numbers.

We need to know one very important thing that our neural network expects input sequence of the same leangth, but the real world dataset or sentence will have different sequence length in 99.9% of the cases. So our last goal is to Make all the sentence of the same leangth and we can achieve this using keras “padding ”and “truncating” logic. to pad and truncate we need to import “pad_sequences” class from “keras.preprocessing.text” in keras.

Padding: We can add 0 in the begining or at end of the sentence to make all sentence of the same length. we used to find longest sentence and apply padding 0 to match the sequence length of the same size. This can be done using below code

padding

we need to specify the sequence and maxlen as a Attribute in the pad_sequences class. In our example we have 2 sentence and the length of longest sentence is 5 so we had passed maxlen as 5 and we got the output sentence, both of length 5. This is the example of padding now we will look truncating

Truncating: Suppose we have 200 sentences, one sentence is of length 100 and all other 199 sentence is of length between 15–20. In this case padding is not a good choice so we will go for truncating in which we will specify maxlen as 20 or 25. As soon as we had specified maxlen as 25, all the 199 sentences will be padded to length 25 and the longest sentence will be truncated from 100 to 25. By this we had achieved the same length of all the input sequence and Congratulation we had made our dataset ready for the input to our neural networks. Below are the few pointers we need to keep in mind to deal with padding and truncating.

  1. By default pad_sequences pads to the longest sequence
  2. Specify maxlen to set the length of the sequences
  3. By default, sequences are padded or truncated from the start of the sequence

Congratulations, Now we understood the concepts of Tokenization step by step and finally we made the dataset ready for the input to the neural networks. In the next blog we will learn more about the embedding and pre trained embeddings like word2vec, glove and BERT step by steps.

For more reference i suggest to go through the keras official documentation. If you have any type of doubts in Tokenization comment below.

Stay Tuned

sentence=[‘I love deep learning’, ‘ do you like deep learning’]

--

--