Manhattan LSTM model for text similarity

6 min readMar 31, 2018

A Brief Summary of Siamese Recurrent Architectures for Learning Sentence Similarity:

One of the important tasks for language understanding and information retrieval is to modelling underlying semantic similarity between words, phrases or sentences. However, a problem remains hard because as having labelled data is scarce and understanding variable length complex data structure is difficult. Traditionally TF-IDF models ruled over several years in natural language processing but limited to understanding context by their inherent term-specificity.

In 2013, Mikolov et al had shown effectiveness in understanding semantic meaning of words using context they are used following the famous quote, “You shall know a word by the company it keeps.” (Firth, J. R. 1957:11). Mikolov word2vec model (skip-gram & CBOW) proved the effectiveness of distributional representation of words in contexts using a neural network and revolutionized the ability to understand expressiveness of the natural language. Recently, it has been progressed from that individual word level understanding to sentence or text level understanding representing each sentence as fixed length vectors (Socher & Manning 2015).

Long Short-Term Memory model (Hochreiter & Schmidhuber, 1997) have been particularly successful in language translation and text classification tasks. LSTM model is built upon basic RNN model but avoiding one of the key limitation of RNN to work with long sequences due to vanishing gradients. LSTM develops a memory cell and uses gates to decide how much information needs to be forgotten or need to flow through the time steps. In this way, useful information can be kept and unnecessary information can be dropped. LSTM uses backpropagation through time (BPTT).

LSTM model and its variants such as Gated Recurrent Unit (GRU) of Cho et al.(2014) have shown that if effectively trained can encode meaning of sentences into a fixed length vector representations.Siamese Neural Network and one shot image recognition proposed by Koch et al. is a different way of classifying image where instead of training one model to learn classify image inputs it trains two neural network that learns simultaneously to find similarity between images. The similar model is extended to text data using this model Siamese LSTM.

In original paper described the model as supervised learning where an input is two sentence pairs having different sequence length and a label for the pair which describe the underlying similarity between sentence pairs. It has shown that this algorithm produces a mapping from a general space f variable length sequences into an interpretable representation with fixed dimensionality vector space.

One of the motivating tasks that can be performed by this representation is scoring similarity between representations of different sentences based on their semantic similarity in meaning.

MaLSTM Model: Jump to code and save time if you wish here https://github.com/GKarmakar

Manhattan LSTM Model

In the model, there is two identical LSTM network. LSTM is passed vector representations of sentences and output a hidden state encoding semantic meaning of the sentences. Subsequently, these hidden states are compared using some similarity mechanism to output a similarity score.

LSTM network learns jointly a mapping from the space of variable length sequences to encode as a fixed dimensional hidden state representation. Similarities in the representation space are subsequently used to infer the sentences underlying semantic similarity. In this model the similarity technique used is: g(h(a) Ta , h(b) Tb ) = exp(−||h(a) Ta − h(b) Tb ||1) ∈ [0, 1].

Sentence pair G S M

A little girl is looking at a woman in costume.

A young girl is looking at a woman in costume. 4.7 4.5 4.8

A person is performing tricks on a motorcycle.

The performer is tricking a person on a motorcycle. 2.6 4.4 2.9

Someone is pouring ingredients into a pot.

A man is removing vegetables from a pot. 2.4 3.6 2.5

Nobody is pouring ingredients into a pot.

Someone is pouring ingredients into a pot. 3.5 4.2 3.7

G = Ground truth relatedness S = Skip Thought M = MaLSTM(This Model)

Let’s get to the implementation of MaLSTM model using Keras to find distance between two texts data:

#Load required Keras libraries:
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.layers import Input, Embedding, LSTM, Merge
import keras.backend as K
from keras.optimizers import Adadelta
from keras.callbacks import ModelCheckpoint
I wrote a routine to clean text data
def text_to_word_list(text):
‘’’ Pre process and convert texts to a list of words ‘’’
text = str(text)
text = text.lower()
# Clean the text
text = re.sub(r”[^A-Za-z0–9^,!.\/’+-=]”, “ “, text)
text = re.sub(r”what’s”, “what is “, text)
text = re.sub(r”\’s”, “ “, text)
text = re.sub(r”\’ve”, “ have “, text)
text = re.sub(r”can’t”, “cannot “, text)
text = re.sub(r”n’t”, “ not “, text)
text = re.sub(r”i’m”, “i am “, text)
text = re.sub(r”\’re”, “ are “, text)
text = re.sub(r”\’d”, “ would “, text)
text = re.sub(r”\’ll”, “ will “, text)
text = re.sub(r”,”, “ “, text)
text = re.sub(r”\.”, “ “, text)
text = re.sub(r”!”, “ ! “, text)
text = re.sub(r”\/”, “ “, text)
text = re.sub(r”\^”, “ ^ “, text)
text = re.sub(r”\+”, “ + “, text)
text = re.sub(r”\-”, “ — “, text)
text = re.sub(r”\=”, “ = “, text)
text = re.sub(r”’”, “ “, text)
text = re.sub(r”(\d+)(k)”, r”\g<1>000", text)
text = re.sub(r”:”, “ : “, text)
text = re.sub(r” e g “, “ eg “, text)
text = re.sub(r” b g “, “ bg “, text)
text = re.sub(r” u s “, “ american “, text)
text = re.sub(r”\0s”, “0”, text)
text = re.sub(r” 9 11 “, “911”, text)
text = re.sub(r”e — mail”, “email”, text)
text = re.sub(r”j k”, “jk”, text)
text = re.sub(r”\s{2,}”, “ “, text)
text = text.split()
return text
#Prepare embedding of the data — I am using quora question pairs
for dataset in [train_df, test_df]:
for index, row in dataset.iterrows():
# Iterate through the text of both questions of the row
for question in questions_cols:
q2n = [] # q2n -> question numbers representation
for word in text_to_word_list(row[question]):
# Check for unwanted words
if word in stops and word not in word2vec.vocab:
continue
if word not in vocabulary:
vocabulary[word] = len(inverse_vocabulary)
q2n.append(len(inverse_vocabulary))
inverse_vocabulary.append(word)
else:
q2n.append(vocabulary[word])
# Replace questions as word to question as number representation
dataset.set_value(index, question, q2n)
I am using 300 dimension for my embedding i.e. there will 300 vectors for each word in the corpora represented for neural network model.
embedding_dim = 300
embeddings = 1 * np.random.randn(len(vocabulary) + 1, embedding_dim) #embedding matrix
embeddings[0] = 0 #padding will be ignored
#Build the embedding matrix
for word, index in vocabulary.items():
if word in word2vec.vocab:
embeddings[index] = word2vec.word_vec(word)
Keras doesn’t come with Manhattan distance calculation, hence we need to write a routine to do that for us.
def exponent_neg_manhattan_distance(left, right):
‘’’ Helper function for the similarity estimate of the LSTMs outputs’’’
return K.exp(-K.sum(K.abs(left-right), axis=1, keepdims=True))
Let’s build the model now:
The visible layer
left_input = Input(shape=(max_seq_length,), dtype=’int32')
right_input = Input(shape=(max_seq_length,), dtype=’int32')
embedding_layer = Embedding(len(embeddings), embedding_dim, weights=[embeddings], input_length=max_seq_length, trainable=False)
# Embedded version of the inputs
encoded_left = embedding_layer(left_input)
encoded_right = embedding_layer(right_input)
# Since this is a siamese network, both sides share the same LSTM
shared_lstm = LSTM(n_hidden)
left_output = shared_lstm(encoded_left)
right_output = shared_lstm(encoded_right)
# Calculates the distance as defined by the MaLSTM model
malstm_distance = Merge(mode=lambda x: exponent_neg_manhattan_distance(x[0], x[1]), output_shape=lambda x: (x[0][0], 1))([left_output, right_output])
# Pack it all up into a model
malstm = Model([left_input, right_input], [malstm_distance])
We need set an optimizer, I am using adadelta but any other popular optimizer such as RMSProp, Adam and even SGD could be tested to see if it increases accuracy, decreases training time by finding better local minima (yes, global minima is an elusive goal still).
# Adadelta optimizer, with gradient clipping by norm
optimizer = Adadelta(clipnorm=gradient_clipping_norm)
Now we will compile and train the model.
malstm.compile(loss=’mean_squared_error’, optimizer=optimizer, metrics=[‘accuracy’])
# Start training
training_start_time = time()
malstm_trained = malstm.fit([X_train[‘left’], X_train[‘right’]], Y_train, batch_size=batch_size, nb_epoch=n_epoch, validation_data=([X_validation[‘left’], X_validation[‘right’]], Y_validation))

That’s all for now, I will share the code in my github here: https://github.com/GKarmakar or https://gist.github.com/GKarmakar/3aa0c643ddb0688a9bfc44b43b84edd8

Manhattan LSTM model for text similarity

Written by Gautam Karmakar