Using Siamese Network for Duplicate Detection in MxNet

Smruthi Mukund
8 min readAug 15, 2019

--

There are a ton of articles that talk about how to build a simple Siamese network for duplicate detection.

  1. https://towardsdatascience.com/one-shot-learning-with-siamese-networks-using-keras-17f34e75bb3d
  2. https://github.com/MahmoudWahdan/Siamese-Sentence-Similarity
  3. https://medium.com/mlreview/implementing-malstm-on-kaggles-quora-question-pairs-competition-8b31b0b16a07

But many of these articles have supporting codes in either Keras or Tensorflow. MxNet based workflow for Siamese is not very well documented, especially for text. In this article I provide code to build a simple Siamese network using MxNet — training and testing done on the Quora Question and Answer dataset. I have also shown how MxNet can be executed on GPU machines — something which is not very well documented as well.

Siamese Network

Inspired by the winning team in the Kaggle competition for identifying similar sentence pairs of Quora-Questions bank, the work here outlines a simple siamese network model generated in MxNet that finds similar sentence pairs for the purpose of deduplication.

What is considered a Siamese Network?

Any network that has two or more identical sub networks is called a siamese network. Such networks seem to perform well on similarity tasks such as sentence similarity, recognizing forged signatures etc.

Siamese networks are also used to perform what is called “One Shot learning”. The advantage of using a siamese network for one shot learning is that it helps to learn a similarity function using limited labeled data. There are many applications that can benefit from these kinds of models.

Data Set

For the purpose of explaining the Siamese network, I use the Quora data set released as a part of the Kaggle data challenge. This data set consists of two text files — one for training and the other for testing. Each row in the data set consists of three columns — a column for question 1, another for question 2 and an indicator column that when set to 1 indicates question 1 to be similar to question 2 , 0 otherwise.

Siamese Architecture

A simple Siamese architecture learns to compute a distance measure between the two similar sentences in the embeddings space. The architecture dictates that the weights of the intermittent hidden layers are shared between the two sentences that go into the network for learning. Any distance measure ( euclidean distance , cosine distance, Manhattan distance ) can be used to learn the similarity threshold.

The architecture of a simple Siamese network is as follows

The embedding layer, the LSTM layer and the Dense layer share the weights. Both sentences are passed through the same layers.

Although the architecture itself is simple to implement, there are several preprocessing steps that the data needs to undergo before being fed as input to the network.

Preprocessing Steps

Outlined below are some basic steps taken to clean up the data. The method to generate word embeddings from Glove vectors is also shown. These steps, are by no means perfect. There are several improvements that can and need to be done above this to improve the model performance. One, for instance, is the process of slotting the entities. The second one is a better way to compute unknown word embeddings.

Note: The code below is attributed to https://medium.com/mlreview/implementing-malstm-on-kaggles-quora-question-pairs-competition-8b31b0b16a07

#this function cleans the text of all punctuations, expands the short forms
def text_to_word_list(text):
''' Pre process and convert texts to a list of words '''
text = str(text)
text = text.lower()
# Clean the text
text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
text = re.sub(r"what's", "what is ", text)
text = re.sub(r"\'s", " ", text)
text = re.sub(r"\'ve", " have ", text)
text = re.sub(r"can't", "cannot ", text)
text = re.sub(r"n't", " not ", text)
text = re.sub(r"i'm", "i am ", text)
text = re.sub(r"\'re", " are ", text)
text = re.sub(r"\'d", " would ", text)
text = re.sub(r"\'ll", " will ", text)
text = re.sub(r",", " ", text)
text = re.sub(r"\.", " ", text)
text = re.sub(r"!", " ! ", text)
text = re.sub(r"\/", " ", text)
text = re.sub(r"\^", " ^ ", text)
text = re.sub(r"\+", " + ", text)
text = re.sub(r"\-", " - ", text)
text = re.sub(r"\=", " = ", text)
text = re.sub(r"'", " ", text)
text = re.sub(r"(\d+)(k)", r"\g<1>000", text)
text = re.sub(r":", " : ", text)
text = re.sub(r" e g ", " eg ", text)
text = re.sub(r" b g ", " bg ", text)
text = re.sub(r" u s ", " american ", text)
text = re.sub(r"\0s", "0", text)
text = re.sub(r" 9 11 ", "911", text)
text = re.sub(r"e - mail", "email", text)
text = re.sub(r"j k", "jk", text)
text = re.sub(r"\s{2,}", " ", text)
text = text.split() return textEMBEDDING_FILE = 'GoogleNews-vectors-negative300.bin.gz'#load stopwords from nltk
stops = set(stopwords.words('english'))
#load word vectors using gensim
word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)
# Prepare embedding
vocabulary = dict()
inverse_vocabulary = ['<unk>'] # '<unk>' will never be used, it is only a placeholder for the [0, 0, ....0] embedding
questions_cols = ['question1', 'question2']# Iterate over the questions only of both training and test datasets
for dataset in [train_df, test_df]:
for index, row in dataset.iterrows():
# Iterate through the text of both questions of the row
for question in questions_cols:
q2n = [] # q2n -> question numbers representation
for word in text_to_word_list(row[question]):
# Check for unwanted words
if word in stops and word not in word2vec.vocab:
continue
if word not in vocabulary:
vocabulary[word] = len(inverse_vocabulary)
q2n.append(len(inverse_vocabulary))
inverse_vocabulary.append(word)
else:
q2n.append(vocabulary[word])
# Replace questions as word to question as number representation
dataset.set_value(index, question, q2n)
#each token in a sentence now needs to reflect its embeddings .. Also the sequence
# in the sentence needs to be padded
#Pad the sequences to maxlen.
#if sentences is greater than maxlen, truncates the sentences
#if sentences is less the 500, pads with value 0 (most commonly occurrning word)
def pad_sequences(sentences,maxlen=500,value=0):
"""
Pads all sentences to the same length. The length is defined by maxlen.
Returns padded sentences.
"""
padded_sentences = []
for sen in sentences:
new_sentence = []
if(len(sen) > maxlen):
new_sentence = sen[:maxlen]
padded_sentences.append(new_sentence)
else:
num_padding = maxlen - len(sen)
new_sentence = np.append(sen,[value] * num_padding)
padded_sentences.append(new_sentence)
return padded_sentences
#generate the embeddings of all the words in teh vocabulary
embedding_dim = 300
embeddings = 1 * np.random.randn(len(vocabulary) + 1, embedding_dim) # This will be the embedding matrix
embeddings[0] = 0 # So that the padding will be ignored
# Build the embedding matrix
for word, index in vocabulary.items():
if word in word2vec.vocab:
embeddings[index] = word2vec.word_vec(word)
#make sure you release mem of word2vec
del word2vec

#generate the maximum length of the sequence
max_seq_length = max(train_df.question1.map(lambda x: len(x)).max(),
train_df.question2.map(lambda x: len(x)).max(),
test_df.question1.map(lambda x: len(x)).max(),
test_df.question2.map(lambda x: len(x)).max())
print(max_seq_length)
# Split to train validation
validation_size = 80000
training_size = len(train_df) - validation_size
X = train_df[questions_cols]
Y = train_df['is_duplicate']
#split into training and test
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size)
# Split to dicts
X_train = {'left': X_train.question1, 'right': X_train.question2}
X_validation = {'left': X_validation.question1, 'right': X_validation.question2}
X_test = {'left': test_df.question1, 'right': test_df.question2}
# Convert labels to their numpy representations
Y_train = Y_train.values
Y_validation = Y_validation.values
# Zero padding
for dataset, side in itertools.product([X_train, X_validation], ['left', 'right']):
dataset[side] = pad_sequences(dataset[side], maxlen=max_seq_length)

Y_net_train = {'label' : Y_train}
Y_net_validation = {'label' : Y_validation}

With the above runs, you now have access to training, test and validation data sets that is ready to be used for our architecture. — X_train, Y_net_train, X_test, X_validation, Y_net_validation

Training and Validation

The Siamese architecture that we are working with is constructed using MxNet as follows

Note : the two embeddings share the same weights. The distance is computed in the forward pass…

class Siamese(gluon.Block):
def __init__(self, input_dim, embedding_dim, **kwargs):
super(Siamese, self).__init__(**kwargs)

#self.nn = gluon.nn.HybridSequential()

self.embedding = nn.Embedding(input_dim, embedding_dim)

self.encoder = gluon.rnn.LSTM(50,
bidirectional=True, input_size=embedding_dim)
#self.nn.add(gluon.rnn.LSTM(10, bidirectional=True))
self.dropout = gluon.nn.Dropout(0.3)
self.dense = gluon.nn.Dense(32, activation="relu")

def forward(self,input0, input1):

out0emb = self.embedding(input0)
out0 = self.encoder(out0emb)

out1emb = self.embedding(input1)
out1 = self.encoder(out1emb)

out0 = self.dense(self.dropout(out0))
out1 = self.dense(self.dropout(out1))

batchsize = out1.shape[0]

xx = out0.reshape(batchsize, -1)
yy = out1.reshape(batchsize, -1)
manhattan_dis = F.exp(-F.sum(F.abs(xx - yy), axis=1, keepdims = True) + 0.0001
return manhattan_dis

At this point in time, there is nothing in the architecture code that indicates GPU preferences to train/evaluate. But when we define the model and initialize the network, we will need to specify the context — which indicates the preference of CPU or GPU.

Note: The architecture itself is very simple, however the challenge is to parallelize the learning across multiple GPUs. This process itself is not well documented for MxNet and required proof reading by folks from Alex Smola’s team (Sheng and Liu). The code below details the parallelization effort.

#initialize the networknet = Siamese(input_dim, embedding_dim)
ctx = d2l.try_all_gpus()
#check if you see all your 8 gpus if you have a p2x.8large instance
print(ctx)
#initialize the network using the context of GPU
net.initialize(init=init.Normal(sigma=0.01), ctx=ctx)

The embeddings of words dictionary need to be made available for all GPUs. Code below shows how to set the context of the embeddings dictionary and make it available to the model.

gpuembeddings = (nd.array(embeddings)).as_in_context(mx.gpu())
#adding pretrained embeddings
net.embedding.weight.set_data(gpuembeddings)
net.embedding.collect_params().setattr('grad_req', 'null')

The loss function that is shown to do well for this task is L2Loss. Also, the gradient I have chosen here is “adadelta”. One can experiment with other types of losses and gradients.

trainer = gluon.Trainer(net.collect_params(), 'adadelta', {'clip_gradient': 1.25})
loss = gluon.loss.L2Loss()

Training process :

The training data that is sent to the net has to be distributed across all the available GPUs. This is accomplished by gluon’s split_and_load function.

The loss can also be accumulated by the d2l Accumulator function. However, for now, I am simply aggregating the loss as and when I get access to the resource.

def train_model(dataiter, epoch):

train_loss = 0
total_size = 0

for i, batch in enumerate(dataiter):

data_list1 = gluon.utils.split_and_load(batch.data[0], ctx, even_split=True)
data_list2 = gluon.utils.split_and_load(batch.data[1], ctx, even_split=True)
label_list = gluon.utils.split_and_load(batch.label[0], ctx, even_split=True)

with autograd.record(): # Start recording the derivatives

losses = [loss(net(X1, X2), Y) for X1, X2, Y in zip(data_list1, data_list2, label_list)]

for l in losses:
l.backward()
trainer.step(batch.data[0].shape[0])
total_size += batch.data[0].shape[0]
train_loss += sum([l.sum().asscalar() for l in losses])


nd.waitall()

return train_loss/total_size

For every epoch, we validate the trained net against the validation set.

def validate_model(valdataiter):
test_loss = 0.
total_size = 0

for batch in valdataiter:
# Do forward pass on a batch of validation data

data_list1 = gluon.utils.split_and_load(batch.data[0], ctx, even_split=False)
data_list2 = gluon.utils.split_and_load(batch.data[1], ctx, even_split=False)
labels = gluon.utils.split_and_load(batch.label[0], ctx, even_split=False)

pys = [loss(net(X1, X2), Y) for X1, X2, Y in zip(data_list1, data_list2, labels)]
test_loss += sum([l.sum().asscalar() for l in pys])

total_size += batch.data[0].shape[0]

return test_loss/total_size

Putting all of this together, the training and validation process looks like this

training_loss = []
validation_loss = []
BATCH_SIZE = 1000
LEARNING_R = 0.001
EPOCHS = 10
THRESHOLD = 0.5
dataiter = mx.io.NDArrayIter(X_train, Y_net_train, BATCH_SIZE, True, last_batch_handle='discard')
valdataiter = mx.io.NDArrayIter(X_validation, Y_net_validation, BATCH_SIZE, True, last_batch_handle='discard')
animator = d2l.Animator('epoch', legend=['train loss','validation loss'], xlim=[1, EPOCHS])
accuracy_lst = []
timer = d2l.Timer()
for epoch in range(EPOCHS):
timer.start()
dataiter.reset()
valdataiter.reset()

train_loss = train_model(dataiter, epoch)
timer.stop()

animator.add(epoch+1, (train_loss, validate_model(valdataiter)) )
print('train loss: %.2f, %.1f sec/epoch on %s' % (
animator.Y[0][-1], timer.avg(), ctx))

Plot of training loss Vs validation loss

ROC curve over the validation set

Stats:
training data :

204544 : are not duplicates
119746 : are duplicates

validation data:

50483 : are not duplicates
29517 : are duplicates

Appendix

The model was run on a p2.8X large EC2 instance that has 8 GPU instances.

--

--