Semantic Similarity in Sentences and BERT

Published in

Analytics Vidhya

9 min readSep 24, 2019

Bidirectional Encoder Representations from Transformers or BERT has been a popular technique in NLP since Google open sourced it in 2018. Using minimal task-specific fine-tuning efforts, researchers have been able to surpass multiple benchmarks by leveraging pre-trained models that can easily be implemented to produce state of the art results. This article consolidates prevalent knowledge on sentence similarity and covers the salient aspects of BERT in this context along with its improvements over previous techniques.

The Data Problem

One of the biggest challenges in natural language processing (NLP) continues to be a shortage of training data. NLP models built from scratch often need prohibitively large datasets to train underlying neural networks to attain reasonable accuracy. This is not always viable, given how time & effort intensive the process of dataset creation can be, not to mention the simple lack of availability of data for specific applications.

Given this scarcity of data and diversity in applications of language, most task-specific datasets contain relatively small (10^3 - 10^5 records) sets of human-labeled training examples. Deep learning models, however, require much larger amounts of annotated training examples (10^6 - 10^9 records) to attain useful levels of accuracy.

To close this gap in data, techniques for pre-training general purpose language representation models was devised using enormous amounts of unannotated text (Wiki corpus in the case of BERT). The pre-trained models can then be fine-tuned for specific NLP tasks like question answering, sentiment analysis and sequence similarity.

Transfer Learning & Fine Tuning

How do these models fare in comparison to corpus-specific CNN or BiLSTM models ? By leveraging transfer learning, fine-tuned BERT models have been shown to perform at least as well as bespoke implementations (or even better) for specific tasks that at-times may rely on obscure architectures.

The introduction of deep pre-trained language models in 2018 (ELMO, BERT, ULMFIT, Open-GPT, etc.) signals the same shift to transfer learning in NLP that occurred previously in computer vision, where researchers where tackling millions of parameters & computationally expensive training tasks to create accurate deep learning networks.

They discovered that deep networks learn hierarchical feature representations (simple features like edges at the lowest layers with gradually more complex features at higher layers). Rather than training a new network from scratch each time, the lower layers of a trained network with generalized image features could be copied and transferred for use in another network with a different task. It soon became common practice to download a pre-trained deep network and quickly retrain it or add additional layers on top for the new task.

In the case of BERT, having been trained on the Wiki corpus, the pre-trained model weights already encode a lot of information about our language. Consequently, it takes much less time to train a fine-tuned model. The authors recommend only 2–4 epochs of training for fine-tuning BERT on a specific NLP task (compared to the hundreds of GPU hours needed to train the original BERT model or an LSTM from scratch). This is vastly preferable to the expensive process of training a network from scratch.

As mentioned before, an equally important consideration is the availability of data. The fine-tuning process requires a much smaller dataset than the one required to build a model from scratch and yields good performance despite the smaller amount of training data.

Because the pre-trained BERT layers already encode a lot of language information, training the classifier is relatively inexpensive. We can focus on training the top layer(s) since the bottom layers now only need a few tweaks to accommodate our task. We may even choose to freeze certain layers when fine-tuning, or apply different learning rates, diminishing learning rates, etc. to preserve the good quality learned weights that the network already encodes and speed up training.

Recent research on BERT has demonstrated that freezing a majority of weights results in only a minimal drop in accuracy. There are exceptions and broader rules of transfer learning to consider however, such as the similarity of the task to be accomplished, the fine-tuning data vs training data. i.e. freezing the weights may not be the best approach if the task & fine-tuning dataset are very different from the task/dataset used to train the transfer learning model.

The task at hand: Semantic Similarity between Sentences

Many NLP applications need to compute the similarity in meaning between two short texts. Modern search engines compute the relevance of a document to a query and not just the simple overlap in words between the two. Popular question-answer sites need to compute the likeness of an asked query to a previously asked question (and even answer). This is accomplished using text similarity by creating useful embeddings from the short texts and calculating the cosine similarity between them.

Word2vec and GloVe use word embeddings in a similar fashion and have become popular models to find the semantic similarity between two words. Sentences however inherently contain more information with relationships between multiple words and a singular approach to compute sentence embeddings is yet to emerge.

The most common method of estimating baseline semantic similarity between a pair of sentences is averaging of the word embeddings of all words in the two sentences and calculating the cosine between the resulting embeddings. This simple baseline can be improved by methods like ignoring stopwords and computing averages weighted byTF-IDF etc. Alternatively, techniques such as Word Mover’s Distance (WMD) and Smooth Inverse Frequency (SIF) can also be employed instead of the baseline approach for better accuracy. However, all these methods share two important characteristics:

They do not take word order into account as they are based on the bag-of-word method. This is a major drawback since differences in word order can often completely alter the meaning of a sentence and sentence embeddings must capture these variations. Capturing references, relationships and sequence of words in sentences is vital for a machine to understand natural language.
Their word embeddings have been learned in an unsupervised manner. While this isn’t a significant problem in itself, it has been observed that supervised training can help sentence embeddings learn the meaning of a sentence more directly.

Pre-trained Sentence Encoders

Pre-trained sentence encoders aim to play the same role as word2vec and GloVe play for words. They are trained on a range of supervised and unsupervised tasks to capture as much universal semantic information as possible. Their embeddings can be used in a variety of applications, such as text classification, paraphrase detection, etc. Google’s Universal Sentence Encoder is a good example, which is available in a simpler version that uses a Deep Averaging Network (DAN) where input embeddings for words and bigrams are averaged together and passed through a feed-forward deep neural network.

Recurrent Neural Network (RNN) based sequence-to-sequence models have also been popular and widely applicable since a significant amount of real word data exists in the form of sequences (number sequence, text sequence, a video frame sequence, audio sequence etc.). Their performance was further improved with the introduction of the attention mechanism.

However, despite these advancements, some challenges continue to persist. Of these, dealing with long-range dependencies and adapting the sequential nature of the model architecture for parallelization are the most notable. The transformer concept was introduced to address these problems by easily allowing parallelization.

BERT

Bi Directional Representations: BERT builds upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit. However, unlike these previous models, BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus (in this case, Wikipedia).

Why does this matter? Pre-trained representations can either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary and do not capture polysemy.

For example, the word “train” would have the same context-free representation in “goods train” and “train the model”. Contextual models instead generate a representation of each word that is based on the other words in the sentence. For example, in the sentence “we will train the model” a unidirectional contextual model would represent “train” based on “we will” but not “the model” However, BERT represents “train” using both its previous and next context — “we will … the model” — starting from the very bottom of a deep neural network, making it deeply bidirectional.

Transformer Encoder: The Transformer in NLP is a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. The concept of self-attention allows the model to look at other words in the input sequence to get a better understanding of a certain word in the sequence. Additionally, self-attention is computed not once but multiple times in the transformer’s architecture, in parallel and independently. It is therefore referred to as Multi-head Attention.

An important note here is that BERT is not trained for semantic sentence similarity directly like the Universal Sentence Encoder or InferSent models. Therefore, BERT embeddings cannot be used directly to apply cosine distance to measure similarity. However, there are easy wrapper services and implementations like the popular bert-as-a-service that can be used to that effect.

Simple implementation: bert-as-a-service

This is a simple example of the popular bert-as-a-service. The BERT_test.py file is a simple modification of example8.py.

1. Install the required server and client

pip install bert-serving-server  # server
pip install bert-serving-client  # client, independent of `bert-serving-server`

2. In a new console window, start the BERT service. Note that you will have to choose the correct path and pre-trained model name for BERT

bert-serving-start -model_dir /tmp/english_L-12_H-768_A-12/ -num_worker=4

3. Switch back to the other console window and run your test script

python BERT_test.py

Transformer-XL

While the transformer architecture is a huge improvement over the existing NN based seq2seq models due to its ability to learn longer-term dependencies, it does come with its own challenges. Transforms can’t stretch beyond a certain level due to the use of fixed-length context (input text segments). Attention can only deal with fixed-length text strings and requires that the text be split into a certain number of segments before being provided as input. This chunking of text causes context fragmentation when the text is split without respecting the sentence or any other semantic boundary. i.e if a sentence is split from the middle, then a significant amount of context may be lost.

The new Transformer-XL architecture was proposed to overcome this shortcoming, which uses hidden states obtained in previous segments and reuses them as a source of information for the current segment. This enables modeling longer-term dependency as the information can flow from one segment to the next. The latest models such as XLNet build on current state of the art methods by using the Transformer XL as its base architecture.

Caveats

Sentence similarity is a relatively complex phenomenon in comparison to word similarity since the meaning of a sentence not only depends on the words in it, but also on the way they are combined. Semantic similarity can have several dimensions, and sentences may be similar in one but opposite in the other.

Like all models, BERT is not the perfect solution that fits all problem areas and multiple models may need to be evaluated for performance depending on the task. For example, if there’s a population of words in a specific domain that significantly alters or drives the over-all meaning in the text, then word-based models like GloVe may better capture these nuances compared to BERT.

Similarly, models such as Google’s USE (Universal Sentence Encoder) that have been designed specifically for sentence embeddings and similarity, may in some instances give better results compared to BERT. It is therefore advisable to run some preliminary tests on sample data and empirically test model performance to determine the direction and type of implementation required.