Why Tokenizer in Keras reserve word_index 0

Xu LIANG
2 min readJul 5, 2018

--

You can find my answer here too. Notebook is here.

Did you wonder why the Tokenizer in Keras reserve word_index 0? As a beginner in NLP field, I did.

from keras.preprocessing.text import Tokenizernum_words = 3
tk = kpt.Tokenizer(oov_token='UNK', num_words=num_words+1)
texts = ["my name is far faraway asdasd", "my name is","your name is"]
tk.fit_on_texts(texts)
print(tk.word_index)
print(tk.texts_to_sequences(texts))
#output
{'your': 7, 'my': 3, 'name': 1, 'far': 4, 'faraway': 5, 'is': 2, 'UNK': 8, 'asdasd': 6}
[[3, 1, 2], [3, 1, 2], [1, 2]]

We can see the Keras assign UNK(UNKONWN) token index as word_count+1. When I saw some preprocess code with TensorFlow, lots of works set the UNK token index as 0 in vocabulary. Why Keras choose word_count+1 as the index for UNK?

The reason is to distinguish between PAD and UNK. Because if we use pad_sequencesto pad the sentence to a fixed length, the pad value will be set as 0. For example, we take 10 as the sequence length.

from keras.preprocessing.sequence import pad_sequences
sequences = tk.texts_to_sequences(texts)
print(sequences)
data = pad_sequences(sequences, maxlen=10, padding='post')
print(data)

#output
[[3, 1, 2, 4, 4, 4], [3, 1, 2], [4, 1, 2]]
[[3 1 2 4 4 4 0 0 0 0]
[3 1 2 0 0 0 0 0 0 0]
[4 1 2 0 0 0 0 0 0 0]]

Because the UNK token index is 4 and PAD value is 0, so we can clearly distinguish them.

After all, this is the text preprocess pipline in Keras, if you do the preprocess job by yourself, you are feel free to set any index you like. But if you use Keras to do the preprocess, it is better to follow this convention.

--

--

Xu LIANG

I’m an engineer focusing on NLP and Data Science. I write stuff to repay the engineer community. You can find me on linkedin.com/in/xu-liang-99356891/