Compute Document Similarity Using Autoencoder With Triplet Loss

Published in

Deep Learning HK

7 min readFeb 12, 2019

In this blog post, we will examine how to compute document similarity for news articles using Denoising Autoencoder (DAE) combined with a Triplet Loss function. This approach is presented in Article De-duplication Using Distributed Representations published by Yahoo! JAPAN and my Tensoflow implementation could be find here.

Introduction

The most common way of computing document similarity is to transform documents into TFIDF vectors and then apply any similarity measure e.g. cosine similarity to these vectors.

However this approach has 2 disadvantages / limitations:

This requires transforming every document into a vector of large dimensions (~10k) and it may not be ideal to compute similarity between these high-dimensional vectors under a tight time constraints which is the case in news domain even when these vectors are sparse.
For news articles talking about similar or exact same event, this approach gives high similarity value.
But for articles of the same category talking about different event may end up with zero similarity value as there are no common keywords.
In other words, this approach does not preserve categorical similarity.

And this is how Yahoo! JAPAN solves the limitations:

By using DAE, we can compress the high-dimension TFIDF vector into low-dimension embeddings.
By using a triplet loss function, we can preserve categorical similarity between documents.

Now let go through DAE and Triplet loss separately.

Denoising Autoencoder

Autoencoder is basically a neural network doing unsupervised learning that tries to reconstruct input vector (encoder) from compressed embeddings (decoder).

Photo credit: http://deeplearning.stanford.edu/wiki/images/f/f9/Autoencoder636.png

In the graph, Layer L1 is the input vector, hidden layer L2 is the embeddings, Layer L3 is the reconstructed input. Usually L1 and L2 are called encoder, L2 and L3 are called decoder.

The idea is that autoencoder will filter out noise and preserve only useful information during encoding so that decoder could use the compressed information to reconstruct the input entirely.

And denoising autoencoder corrupts the input stochastically by setting some percentage of input elements to zero during training phase. This is motivated by making embeddings more robust to small irrelevant changes in input.

Below is the formulation of a traditional DAE:

Formulation of traditional denoising autoencoder

The loss function used is usually squared loss or cross-entropy.

Triplet Loss

Triplet loss is first introduced in FaceNet: A Unified Embedding for Face Recognition and Clustering by Google which used to train faces’ embeddings.

With triplet loss, 3 inputs are required during training phase:

Anchor
Positive (item which has the same label as anchor)
Negative (item which has different label to anchor)

During training, triplet loss will try to minimise distance (maximise similarity) between anchor and positive, while maximise distance (minimise similarity) between anchor and negative.

Euclidean distance between images’ embeddings

Above is an example from FaceNet, we can see images of same person have lower euclidean distance between their embeddings. Therefore, it shows that the learned embeddings preserve distance / similarity information of the original images.

Yahoo! JAPAN applied the same idea to news articles’ embeddings to preserve categorical similarity, although the loss function is different compared to the one used by FaceNet.

Denoising Autoencoder + Triplet Loss

The approach presented in Article De-duplication Using Distributed Representations used DAE with triplet loss to generate embeddings which preserve categorical similarity and here is how it is done.

Binary word count vector is used as input to the DAE, and for every news article (anchor), we need another news article that has the same category label (positive) and another news article that has different category label (negative).

Notation
The input vector of (anchor, positive, negative) denoted as (x1, x2, x3)
The embeddings / hidden layer of (anchor, positive, negative) denoted as (h1, h2, h3)

Loss function
To preserve the categorical similarity, we want h1．h2 > h1．h3 since x1 is more similar to x2 than x3. Hence the loss function of DAE is updated as follows:

From (1), h satisfies the property x=0 => h = f(b) — f(b) = 0, which means an article has no available information is not similar to any other articles.

From (2), ø is the penalty function for article similarity and from the curve above, minimising ø means making h1．h2 > h1．h3

Note that the overall objective is not only to minimise ø but to minimise L and ø where L is elementwise cross entropy loss function. Without L, the network may end up with zero weights so h=0 for all articles. Hence L is necessary to make sure the embeddings h preserve useful information about the articles while ø make sure similar articles have similar embeddings.

After training, we could apply cosine similarity on the embeddings to check how similar two articles are.

Triplet Mining

As said before, to train the DAE with triplet loss, we need to prepare a set of triplet (anchor, positive, negative) as training data. So for every anchor, we need to map a positive and a negative. It could be a lot of works if this is done manually.

This could actually be done automatically with triplet mining.

With triplet mining, we only need to provide anchor and label of the anchor to DAE during training. And it will find out all possible triplets by constructing a 3-dimensional matrix. A much more detailed explanation and Tensorflow implementation could be find here.

Note that triplet mining is not mentioned in the paper, but I found this saves a lot of effort to create the triplet manually.

Evaluation

Let test the performance of this approach on some open dataset.

Dataset
A dataset which contains 10348 news articles with categories labels and story labels is used for evaluation.

The dataset is available at https://www.kaggle.com/louislung/uci-news-aggregator-dataset-with-content

Code
I have implemented this autoencoder using Tensorflow and you may find the implementation here in my github.

Settings
The embeddings is trained with following autoencoder network’s settings:

Input vector = Binary vector corresponding to each word in the vocabulary (Scikit learn’s Count Vectorizer is used to prepare the input vector)
Dimension of Input layer = 10000, Embeddings layer =50, Output layer = 10000
Optimization Algorithm = Adam
Corruption Function = Every element of input vector is masked (set to zero) randomly with probability 0.3
Alpha = 10 (hyper-parameter of the loss function)
Loss function of autoencoder = Cross-Entropy
Batch size = 100
Number of epochs = 200

The dataset is splitted into two: 5000 for training and 5348 for testing.

Results
We can evaluate the performance of the embeddings by checking the AUROC of cosine similarity of similar articles (article of the same category or same story).

We use both embeddings and Tfidf vector to calculate the cosine similarity and plot the ROC below.

The trained embeddings is able to preserve categorical similarity between articles (higher auroc achieved in both training and testing set).

While Tfidf is still good at identifying similar articles at story level as the embeddings loses some information due to compression.

We can see that embeddings trained with the approach is highly compressed (dimensions reduced from 10000 to 50) and is able to preserve categorical similarity.

Wrapping Up

We have seen how to train embeddings which preserve categorical similarity using a variant of denoising autoencoder by introducing triplet loss into the model.

This approach of finding similar articles may be preferred in news domain where there is tight time constraint (latest news article should be displayed to users within short time).

Moreover, a lot of tasks could be done after having these embeddings for every news articles: