Build a simple predictive keyboard using python and Keras
Keyboards are our part of life. we use it in every computing environment. To reduce our effort in typing most of the keyboards today give advanced prediction facilities. it predicts the next character, or next word or even it can autocomplete the entire sentence. So let’s discuss a few techniques to build a simple next word prediction keyboard app using Keras in python. This tutorial is inspired by the blog written by Venelin Valkov on the next character prediction keyboard.
We use the Recurrent Neural Network for this purpose. This model was chosen because it provides a way to examine the previous input. LSTM, a special kind of RNN is also used for this purpose. The LSTM provides the mechanism to preserve the errors that can be backpropagated through time and layers which helps to reduce vanishing gradient problem.
Let’s Code!
To start with we need to install a few libraries.
pip install numpy
pip install tensorflow
pip install keras
pip install nltk
Now let’s import the required libraries.
import numpy as np
from nltk.tokenize import RegexpTokenizer
from keras.models import Sequential, load_model
from keras.layers import LSTM
from keras.layers.core import Dense, Activation
from keras.optimizers import RMSprop
import matplotlib.pyplot as plt
import pickle
import heapq
Loading the dataset is the next important step to be done, here we use The Adventures of Sherlock Holmes as the dataset.
path = '1661-0.txt'
text = open(path).read().lower()
print('corpus length:', len(text))Output
corpus length: 581887
Now, we want to split the entire dataset into each word in order without the presence of special characters.
tokenizer = RegexpTokenizer(r'\w+')
words = tokenizer.tokenize(text)Output
['project', 'gutenberg', 's', 'the', 'adventures', 'of', 'sherlock', 'holmes', 'by', ............................... , 'our', 'email', 'newsletter', 'to', 'hear', 'about', 'new', 'ebooks']
Next, for the feature engineering part, we need to have the unique sorted words list. We also need a dictionary(<key: value>) with each word form the unique_words list as key and its corresponding position as value.
unique_words = np.unique(words)
unique_word_index = dict((c, i) for i, c in enumerate(unique_words))
Feature engineering
According to Wikipedia, Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning and is both difficult and expensive.
We define a WORD_LENGTH which means that the number of previous words that determines the next word. Also, we create an empty list called prev_words to store a set of five previous words and its corresponding next word in the next_words list. We fill these lists by looping over a range of 5 less than the length of words.
WORD_LENGTH = 5
prev_words = []
next_words = []
for i in range(len(words) - WORD_LENGTH):
prev_words.append(words[i:i + WORD_LENGTH])
next_words.append(words[i + WORD_LENGTH])
print(prev_words[0])
print(next_words[0])Output
['project', 'gutenberg', 's', 'the', 'adventures']
of
Now, its time to generate feature vectors. For generating feature vector we use one-hot encoding.
Here, we create two numpy array X(for storing the features) and Y(for storing the corresponding label(here, next word)). We iterate X and Y if the word is present then the corresponding position is made 1.
X = np.zeros((len(prev_words), WORD_LENGTH, len(unique_words)), dtype=bool)
Y = np.zeros((len(next_words), len(unique_words)), dtype=bool)for i, each_words in enumerate(prev_words):
for j, each_word in enumerate(each_words):
X[i, j, unique_word_index[each_word]] = 1
Y[i, unique_word_index[next_words[i]]] = 1
Let’s look at a single sequence:
print(X[0][0])Output
[False False False … False False False]
Building the model
We use a single-layer LSTM model with 128 neurons, a fully connected layer, and a softmax function for activation.
model = Sequential()
model.add(LSTM(128, input_shape=(WORD_LENGTH, len(unique_words))))
model.add(Dense(len(unique_words)))
model.add(Activation('softmax'))
Training
The model will be trained with 20 epochs with an RMSprop optimizer.
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = model.fit(X, Y, validation_split=0.05, batch_size=128, epochs=2, shuffle=True).history
After successful training, we will save the trained model and just load it back as needed.
model.save('keras_next_word_model.h5')
pickle.dump(history, open("history.p", "wb"))model = load_model('keras_next_word_model.h5')
history = pickle.load(open("history.p", "rb"))
Evaluation
The model outputs the training evaluation result after successful training, also we can access these evaluations from the history variable.
{‘val_loss’: [6.99377903472107, 7.873811178441364], ‘val_accuracy’: [0.1050897091627121, 0.10563895851373672], ‘loss’: [6.0041207935270124, 5.785401324014241], ‘accuracy’: [0.10772078, 0.14732216]}# sample evaluation ---- # only 2 epochs
Prediction
Now, we need to predict new words using this model. To do that we input the sample as a feature vector. we convert the input string to a single feature vector.
def prepare_input(text):
x = np.zeros((1, WORD_LENGTH, len(unique_words)))
for t, word in enumerate(text.split()):
print(word)
x[0, t, unique_word_index[word]] = 1
return xprepare_input("It is not a lack".lower())Output
array([[[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]]])
To choose the best possible n words after the prediction from the model is done by sample function.
def sample(preds, top_n=3):
preds = np.asarray(preds).astype('float64')
preds = np.log(preds)
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
return heapq.nlargest(top_n, range(len(preds)), preds.take)
finally, for prediction, we use the function predict_completions which use the model to predict and return the list of n predicted words.
def predict_completions(text, n=3):
if text == "":
return("0")
x = prepare_input(text)
preds = model.predict(x, verbose=0)[0]
next_indices = sample(preds, n)
return [unique_words[idx] for idx in next_indices]
Now let’s see how it predicts, we use tokenizer.tokenize fo removing the punctuations and also we choose 5 first words because our predicts base on 5 previous words.
q = "Your life will never be the same again"
print("correct sentence: ",q)
seq = " ".join(tokenizer.tokenize(q.lower())[0:5])
print("Sequence: ",seq)
print("next possible words: ", predict_completions(seq, 5))Output
correct sentence: Your life will never be the same again
Sequence: your life will never be
next possible words: ['the', 'of', 'very', 'no', 'in']
Drawbacks
- Here while preparing unique words we only collected unique words from the input dataset, not from the English dictionary. So many got omitted because of this reason. ( To create such a large input set (English dictionary contains ~23000 words as per nltk we need to perform Batch processing)
References