It started with a simple quest to code a VAE model for topic modeling in Keras. I’d seen some examples of VAE’s on MNIST data, and I thought I could just take that and re-purpose it for my text problem. I know how foolish that sounds now. Papers such as this and this etc. will give you an idea of how difficult a task this is. The aim however, as it has been shown possible by these papers, will be to replicate or achieve similar results.
So down the bunny trail, and I realized the only way to go forward was to go back.
So here I am writing up a basic sequence 2 sequence model on a toy problem — autoencoding text data to get back the original data
Variable length sequences are padded to a fixed sequence length, but the mask_zero=True parameter can be used to disregard zero values (typically the PAD value)
Briefly, the steps are as follows:
# read data sets...
# tokenize and convert to integer sequences
# pad sequences to make of same length
# convert y_train to one-hot encoded target
y_train = np.zeros((X_train.shape, seq_len, len(vocab)))
for i in range(X_train.shape):
for j, word in enumerate(X_train[i]):
y_train[i, j, word] = 1.
# seq2seq model
model = Sequential()
model.add(Embedding(len(vocab) + 1, EMBEDDING_DIM,
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.fit(X_train, y_train, nb_epoch=nb_epoch, batch_size=batch_size)
After 1000 epochs, on training data of ~ 3000 sentences of news titles, the training loss is 0.22 and accuracy 0.95
Some of the example input-output constructions on test data are:
“PAD ios 7 1 update released adds carplay improves siri and fixes bugs”
“apple ios 7 7 1 update boost support ios ios ios ios and”
“PAD PAD PAD PAD PAD congress to hold hearings no belated gm recall”
“PAD PAD PAD PAD congress congress recall gm gm lower gm recall recall”
“PAD PAD pound forecasts boosted to highest since 2011 on boe rate bets”
“PAD PAD PAD google prices jump rate to valukas valukas since major policy”
Loss could be reduced further, but already it is predicting some words from the input sentence.
This model of course, in itself, does not solve any useful problem. What would be better would be to use seq2seq learning on tasks like question answering or text summarization or translation. This is just an example of how to get it to work using Keras, and to use as a base for more interesting models like VAEs etc.