Implementing MaLSTM on Kaggle’s Quora Question Pairs competition

This article is about the MaLSTM Siamese LSTM network (link to article on the second paragraph) for sentence similarity and its appliance to Kaggle’s Quora Pairs competition.
I will do my best to explain the network and go through the Keras code (if you are only here for the code, scroll down :)
Full code on github


In the past few years, deep learning is all the fuss in the tech industry.
To keep up on things I like to get my hands dirty implementing interesting network architectures I come across in article readings.

Few months ago I came across a very nice article called Siamese Recurrent Architectures for Learning Sentence Similarity which offers a pretty straightforward approach at the common problem of sentence similarity.
Named MaLSTM (“Ma” for Manhattan distance), its architecture is depicted in figure 1 (diagram excludes the sentence preprocessing part)
Notice that since this is a Siamese network, it is easier to train because it shares weights on both sides.

Figure 1 MaLSTM’s architecture - Similar color means the weights are shared between the same-colored elements

Network explained

(I will be using Keras, so some technical details are related to the implementation)

So first of all, what is a “Siamese network”? 
Siamese networks are networks that have two or more identical sub-networks in them.
Siamese network seem to perform good on similarity tasks and have been used for tasks like sentence semantic similarity, recognizing forged signatures and many more.

In MaLSTM the identical sub-network is all the way from the embedding up to the last LSTM hidden state.
Word embedding is a modern way to represent words in deep learning models, more about it can be found in this nice blog post.
Essentially it’s a method to give words semantic meaning in a vector representation.

Inputs to the network are zero padded sequences of word indices, these inputs are vectors of fixed length, where the first zeros are being ignored and the non zeros are indices that uniquely identify words.
Those vectors are then fed into the the embedding layer, which looks ups the corresponding embedding for each word and encapsulates all of them into a matrix which represents the given text as a series of embeddings.
The embedding used are Google’s word2vec as in the original paper.
The process is depicted in figure 2.

Figure 2 Embedding process

Then the two embedded matrices representing the candidate similar questions are fed into the LSTM (practically, there is only one LSTM) and the final state of the LSTM for each question is a 50 dimensional vector which is trained to capture the semantic meaning of the question.
In figure 1, this vector is denoted by the letter h.
If you don’t entirely understand LSTMs, I suggest reading this wonderful post.

By now we have the two vectors that hold the semantic meaning of each question, now we put them through the defined similarity function (below)

MaLSTM similarity function

and since we have an exponent of a negative the output (the prediction in our case) will be between 0 and 1.

Training

The optimizer of choice in the article is the Adadelta optimizer, which can be read about in this article, using gradient clipping to avoid the exploding gradient problem.
Nice explanation taken from the Udacity deep learning course about gradient clipping in this video.

This is where I will diverge a little from the original paper for the sake of simplicity — in the original paper the LSTM weights are initialized in a specific manner and then (pre)trained on some other task.

Other parameters such as batch size, epochs and the gradient clipping norm value are chosen by me.


CODE

My full implementation can be found in this Jupyter notebook — keep following this post to see only the significant parts.

Here (and in the notebook) I’ve excluded all of the data analysis part, again to keep things simple and the article readable.

Preprocessing

We get the data as raw text, so our first mission is to take the text and convert it into lists of word indices.
When first opening the training data files in pandas, this is what you get.

Figure 3 First lines of the raw training dataset

Our columns of interest are question1, question2 and is_duplicate which are self explanatory.
The training data is stored in train_df and the test data in test_df and both are pandas DataFrames.
Only difference between train_df and test_df is that the latter doesn’t have the is_duplicate column

train_df = pd.read_csv(TRAIN_CSV)
test_df = pd.read_csv(TEST_CSV)

Next I created a helper function named text_to_word_list(text) which takes string as input and outputs a list where each entry is a single word from the text and does some preprocessing (removing specific signs etc).

Now our aim is to have the ability to turn a word into its embedding given by word2vec, in order to do that we will need to build:

  • vocabulary which is a dict where the keys are words (str) and values are the corresponding indices (a unique id as int).
  • inverse_vocabulary which is a list of words (str) where the index in the list is the matching id (fromvocabulary).
    We reserve the first place for an all zeros embedding — this is needed for the zero padding to be ignored.

We also use gensim.models.KeyedVectors to load the word2vec embeddings.
Throughout the code only 2 functions of this class will be used, .vocab which will hold all of the word2vec words and .word_vec(word) which takes a word and returns its embedding.
Finally we will use nltk's English stopwords and store them in stops.

Creating vocabulary and inverse_vocabulary:

vocabulary = dict()
inverse_vocabulary = ['<unk>']  
# '<unk>' will never be used, it is only a placeholder for the
# [0, 0, ....0] embedding
word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE,binary=True)

questions_cols = ['question1', 'question2']

# Iterate over the questions only of both training and test datasets
for dataset in [train_df, test_df]:
for index, row in dataset.iterrows():

# Iterate through the text of both questions of the row
for question in questions_cols:

q2n = [] # q2n -> question numbers representation
for word in text_to_word_list(row[question]):

# Check for unwanted words
if word in stops and word not in word2vec.vocab:
continue

if
word not in vocabulary:
vocabulary[word] = len(inverse_vocabulary)
q2n.append(len(inverse_vocabulary))
inverse_vocabulary.append(word)
else:
q2n.append(vocabulary[word])

# Replace questions with lists of word indices
dataset.set_value(index, question, q2n)

So now we have vocabulary, inverse_vocabulary and both train_df and test_df converted to word indices, screenshot below.

Figure 4 First lines of the converted to word indices training dataset.

Notice we start at index 1, index 0 is reserved for the zero padding.
Also notice I do not exclude stopwords if they have embeddings, I will later give them a random representation — this is done for the sake of simplicity, a far better approach will be to train your own embeddings to better capture context of the problem.

Embedding matrix

Our next goal is to create the embedding matrix.
We will assign each word its word2vec embedding and leave the unrecognized ones (less than 0.5%) random.
Also we keep the first index all zeros.

embedding_dim = 300
# This will be the embedding matrix
embeddings = 1 * np.random.randn(len(vocabulary) + 1, embedding_dim)
embeddings[0] = 0 # So that the padding will be ignored


# Build the embedding matrix
for word, index in vocabulary.items():
if word in word2vec.vocab:
embeddings[index] = word2vec.word_vec(word)

Great, we have our embedding matrix in place.

Data preparation

In order to prepare our data for use in Keras we have to do two things:

  • Split our data to ‘left’ and ‘right’ inputs (one for each side of the MaLSTM)
  • Pad all of the word number sequences with zeros

We will also create a validation dataset, to measure our model using scikit-learn’s train_test_split function — it keeps the labels distribution between the datasets by default. 
In max_seq_length we have the length of the longest question, and here is the code

# Split to train validation
validation_size = 40000
training_size = len(train_df) - validation_size

X = train_df[questions_cols]
Y = train_df['is_duplicate']

X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size)

# Split to dicts
X_train = {'left': X_train.question1, 'right': X_train.question2}
X_validation = {'left': X_validation.question1, 'right': X_validation.question2}
X_test = {'left': test_df.question1, 'right': test_df.question2}

# Convert labels to their numpy representations
Y_train = Y_train.values
Y_validation = Y_validation.values

# Zero padding
for dataset, side in itertools.product([X_train, X_validation], ['left', 'right']):
dataset[side] = pad_sequences(dataset[side], maxlen=max_seq_length)

itertools.product simply gives all the combinations between the two lists.

Model

Now we create the model itself.
Most of the code is pretty clear just by reading it but I would like to take a moment to talk about Keras Merge layer.
The Merge layer allows us to merge elements with some built in methods, but also supports custom methods and this is where it comes in handy since we need to “merge” our two LSTMs output using the MaLSTM similarity function.
First lets define the MaLSTM similarity function.

def exponent_neg_manhattan_distance(left, right):
return K.exp(-K.sum(K.abs(left-right), axis=1, keepdims=True))

Now lets build the model (using functional API)

# The visible layer
left_input = Input(shape=(max_seq_length,), dtype='int32')
right_input = Input(shape=(max_seq_length,), dtype='int32')

embedding_layer = Embedding(len(embeddings), embedding_dim, weights=[embeddings], input_length=max_seq_length, trainable=False)

# Embedded version of the inputs
encoded_left = embedding_layer(left_input)
encoded_right = embedding_layer(right_input)

# Since this is a siamese network, both sides share the same LSTM
shared_lstm = LSTM(n_hidden)

left_output = shared_lstm(encoded_left)
right_output = shared_lstm(encoded_right)

# Calculates the distance as defined by the MaLSTM model
malstm_distance = Merge(mode=lambda x: exponent_neg_manhattan_distance(x[0], x[1]), output_shape=lambda x: (x[0][0], 1))([left_output, right_output])

# Pack it all up into a model
malstm = Model([left_input, right_input], [malstm_distance])

Don’t you love it how simple Keras is?
Next we define the optimizer and compile our model.

# Adadelta optimizer, with gradient clipping by norm
optimizer = Adadelta(clipnorm=gradient_clipping_norm)

malstm.compile(loss='mean_squared_error', optimizer=optimizer, metrics=['accuracy'])

Now all that is left is to train it!

Training and results

I launched it on my local machine, which has a GTX 1070.
The whole script and including preparation and training took about 21 hours.

malstm_trained = malstm.fit([X_train['left'], X_train['right']], Y_train, batch_size=batch_size, nb_epoch=n_epoch,
validation_data=([X_validation['left'], X_validation['right']], Y_validation),
callbacks=[checkpointer])

To properly evaluate the model performance, lets plot training data vs validation data accuracy and loss

Accuracy:

Loss:

So just like that out of the box, seems that MaLSTM is doing OK, getting an 82.5% accuracy rate on the validation data.

Improvements

This article and the code as well was written with simplicity in mind.
To achieve state of the art results tuning and adjusting to your specific use case will always be needed.

Here are some thoughts of mine where can we go from here:

  • Use transfer learning just like in the article, to pretrain your LSTM
  • Train word embeddings, on the data questions.
  • Create new data, from my data exploration (is not present in this article) there are some duplicate questions that appear twice against different questions, it is possible that using this we can create more data.
  • Create new data using data augmentation swapping words from the text with synonyms
  • Choose a different optimizer, I’ve read a lot that Adadelta doesn’t perform as good other methods when finely tuned.

The double edged sword of deep learning is that this list is infinite, I’ve put there a tiny bit of possible things that keep us within the MaLSTM architecture.

If you want to explore further and to get the best results possible, I advise you to look at the discussion about the competition — they achieved some really impressive results there using various models and ensembling.


The purpose of this post was to put into work a good article that implements some things that you don’t really see in tutorials and stuff, I hope this post and the code has taught you a thing or two. Happy coding :)