Siamese BERT-Networks

Richer Sentence Embeddings using Sentence-BERT — Part I

Using naive sentence embeddings from BERT or other transformer models leads to underperformance. To get around this, we can fine-tune BERT in a siamese fashion. The result is a rapid generation of rich sentence embeddings.

Laksh Aithani
Jan 1 · 7 min read

BERT revolutionized the field of NLP by gaining state-of-the-art results on several NLP benchmarks .

In many cases, it outperformed human performance .

Fast-forward 1 year along, and several improved variants of BERT have popped up, with new ones being released by large tech companies seemingly every month.

We will first briefly review BERT (a more in-depth review is ), and then explain how to efficiently generate rich sentence embeddings using BERT.

Specifically, we will discuss a recent paper from UKP (Ubiquitous Knowledge Processing Lab): [9]

In part II of this post, we will implement an MVP of this strategy in PyTorch.

1. Preliminaries: BERT is trained to give rich word embeddings

BERT is very good at generating (word vectors) that are rich in semantics and depend heavily on context.

The sentences “I ate an apple” and “Apple acquired a startup” will have completely different word embeddings for “apple” generated by BERT, due to the context of the words.

Older systems like Word2vec and Glove had poorer performance because their word embeddings didn’t dynamically change based on the context of the surrounding vector. In other words, they were fixed.

BERT is trained using a denoising objective (masked language modeling), where it aims to reconstruct a noisy version of a sentence back into its original version. The concept is similar to .

BERT reconstructs partially masked sentences using the context of the surrounding sentence (Masked language modeling)

The original BERT also uses a next-sentence prediction objective, but it was shown in the RoBERTa paper [8] that this training objective doesn’t help that much.

In this way, BERT is trained on gigabytes of data from various sources (e.g much of Wikipedia) in an unsupervised fashion.

2. (Old) Sentence Embedding Methods are not Rich

For many NLP tasks, we need sentence embeddings. This includes, but is not limited to, , within documents and .

At , we make use of sentence embeddings to cluster sentences in documents, which aids in the automatic extraction of key information from large bodies of text.

Two main methods for generating sentence embeddings from BERT are given below:

2a. Averaging Method

The most common BERT-based methods to generate sentence embeddings by simply averaging the word embedding of all words in a sentence:

We average the word embeddings in a sentence to get the sentence embedding

2b. Using the [CLS] vector as the sentence embedding

Alternatively, we can use the embedding for the [CLS] special token that appears at the start of the sentence.

The [CLS] token (shown in orange) is used as a sentence embedding in this paper that uses BERT for extractive summarization

It turns out that the sentence embeddings generated by these methods aren’t that good.

UKP researchers [9] showed that on textual similarity (STS) tasks, using either the averaging or [CLS] method for sentence embeddings using BERT gives poor results. Even GloVe vectors [11] significantly outperform naive BERT sentence embeddings.

Results from the Sentence-BERT paper [9]

3. SentenceBERT: Fine-tuning BERT to give good Sentence Embeddings

The idea is to fine-tune BERT sentence embeddings on a dataset which rewards models that generates sentence embeddings that have the following property:

When the of the pair of sentence embeddings is computed, we want it to represent accurately the of the two sentences.

If we obtain a model that does this, we can generate sentence embeddings for each sentence once (each forward-pass through BERT is computationally expensive), and then compute a cosine similarity for each pair (computationally rapid and cheap). This method effectively scales as O(n).

This is orders of magnitude better than having to pass in each pair of sentences through BERT. This is the current state of the art but is very computationally expensive and scales as O(n²)).

3a. (Pre)-Training Strategy: Siamese Neural Network

The general idea introduced in [9] is to pass 2 sentences through BERT, in a . A good diagrammatic summary is below:

SentenceBERT employs Siamese BERT networks to pretrain on the SNLI dataset

The idea is simple enough to state. We obtain sentence embeddings for a pair of sentences. We then concatenate the embeddings as follows: (u, v, ‖u-v‖), multiply by a trainable weight matrix W∈ℝ³ᴺ ˣ ᴷ, where N is the sentence embedding dimension, and K is the number of labels.

3b. Ablation study for pooling and concatenation strategies

The pooling operation is flexible, although the researchers found that a mean aggregation worked best (compared to a max or CLS aggregation strategy).

Several concatenation strategies were tried as well; (u, v, ‖u-v‖) worked the best.

Ablation results from the paper are shown below:

Ablation studies for different pooling and concatenation strategies

3c. Inference at test time

At inference, we compute sentence embeddings and then compute the cosine similarity of the respective pairs of sentences we want to compute the semantic textual similarity of:

Inference for SBERT on STS benchmarks

Interestingly enough, training (in fig 1) on the SNLI dataset, but doing inference on the STS datasets results in pretty good metrics, even though no specific training has been done on STS.

3d. Quick Look at the Datasets: STS Benchmarks and SNLI dataset

The output of the siamese network was trained to match that of a group of labeled datasets: the STS benchmarks . These datasets provide labels from 0 to 5 for the semantic relatedness of a pair of sentences:

STS training example

The SNLI (Stanford Natural Language Inference) dataset contains 570k human-written English sentence pairs manually labeled (by ) for balanced classification with the labels: entailment, contradiction, neutral.

Training examples from SNLI dataset

3e. Final Results of the Paper

We’ll quickly take a look at the final results the paper obtains:

Final results…

Clearly, fine-tuning on both NLI + STS results in the best models. Interestingly enough, using RoBERTa [8] doesn’t seem to help that much over BERT…

Finally, note the improvement we get over using the average BERT embeddings (line 2 of the table). The result is a step improvement.

4. Looking Forward — Part II

In part II of this blog post, we’ll look at an implementation of the Siamese BERT Network in PyTorch!

Until then, keep up to date with Genei’s progress:

Website and Demo:

Facebook link:



Genei is an Ed-tech startup working on improving the productivity of students and academics by harnessing the power of NLP.


[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. . In NAACL-HLT

[2] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

[3] John Pavlus. Machines Beat Humans on a Reading Test. But Do They Understand? Quanta Magazine

[4] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.

[5] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints.

[6] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.

[7] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197.

[8] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar S. Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke S. Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019.

[9] Reimers, N., and Gurevych, I. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, EMNLP.

[10] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality. Accepted to NIPS 2013.

[11] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1532–1543, 2014.

[12] Yang Liu. Fine-tune BERT for extractive summarization. arXiv preprint arXiv:1903.10318, 2019.

[13] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017.


Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade