Did you wonder why the Tokenizer in Keras reserve word_index 0? As a beginner in NLP field, I did.
from keras.preprocessing.text import Tokenizernum_words = 3
tk = kpt.Tokenizer(oov_token='UNK', num_words=num_words+1)
texts = ["my name is far faraway asdasd", "my name is","your name is"]
tk.fit_on_texts(texts)
print(tk.word_index)
print(tk.texts_to_sequences(texts))#output
{'your': 7, 'my': 3, 'name': 1, 'far': 4, 'faraway': 5, 'is': 2, 'UNK': 8, 'asdasd': 6}
[[3, 1, 2], [3, 1, 2], [1, 2]]
We can see the Keras assign UNK(UNKONWN) token index as word_count+1
. When I saw some preprocess code with TensorFlow, lots of works set the UNK token index as 0 in vocabulary. Why Keras choose word_count+1
as the index for UNK?
The reason is to distinguish between PAD and UNK. Because if we use pad_sequences
to pad the sentence to a fixed length, the pad value will be set as 0. For example, we take 10 as the sequence length.
from keras.preprocessing.sequence import pad_sequences
sequences = tk.texts_to_sequences(texts)
print(sequences)
data = pad_sequences(sequences, maxlen=10, padding='post')
print(data)
#output
[[3, 1, 2, 4, 4, 4], [3, 1, 2], [4, 1, 2]]
[[3 1 2 4 4 4 0 0 0 0]
[3 1 2 0 0 0 0 0 0 0]
[4 1 2 0 0 0 0 0 0 0]]
Because the UNK token index is 4 and PAD value is 0, so we can clearly distinguish them.
After all, this is the text preprocess pipline in Keras, if you do the preprocess job by yourself, you are feel free to set any index you like. But if you use Keras to do the preprocess, it is better to follow this convention.