Richer Sentence Embeddings using Sentence-BERT — Part I
Using naive sentence embeddings from BERT or other transformer models leads to underperformance. To get around this, we can fine-tune BERT in a siamese fashion. The result is a rapid generation of rich sentence embeddings.
In many cases, it outperformed human performance .
We will first briefly review BERT (a more in-depth review is here), and then explain how to efficiently generate rich sentence embeddings using BERT.
Specifically, we will discuss a recent paper from UKP (Ubiquitous Knowledge Processing Lab): Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks 
In part II of this post, we will implement an MVP of this strategy in PyTorch.
1. Preliminaries: BERT is trained to give rich word embeddings
BERT is very good at generating word embeddings (word vectors) that are rich in semantics and depend heavily on context.
The sentences “I ate an apple” and “Apple acquired a startup” will have completely different word embeddings for “apple” generated by BERT, due to the context of the words.
Older systems like Word2vec  and Glove  had poorer performance because their word embeddings didn’t dynamically change based on the context of the surrounding vector. In other words, they were fixed.
BERT is trained using a denoising objective (masked language modeling), where it aims to reconstruct a noisy version of a sentence back into its original version. The concept is similar to autoencoders.
The original BERT also uses a next-sentence prediction objective, but it was shown in the RoBERTa paper  that this training objective doesn’t help that much.
In this way, BERT is trained on gigabytes of data from various sources (e.g much of Wikipedia) in an unsupervised fashion.
2. (Old) Sentence Embedding Methods are not Rich
For many NLP tasks, we need sentence embeddings. This includes, but is not limited to, semantic similarity comparison, sentence clustering within documents and information retrieval via semantic search.
At Genei, we make use of sentence embeddings to cluster sentences in documents, which aids in the automatic extraction of key information from large bodies of text.
Two main methods for generating sentence embeddings from BERT are given below:
2a. Averaging Method
The most common BERT-based methods to generate sentence embeddings by simply averaging the word embedding of all words in a sentence:
2b. Using the [CLS] vector as the sentence embedding
Alternatively, we can use the embedding for the [CLS] special token that appears at the start of the sentence.
It turns out that the sentence embeddings generated by these methods aren’t that good.
UKP researchers  showed that on textual similarity (STS) tasks, using either the averaging or [CLS] method for sentence embeddings using BERT gives poor results. Even GloVe vectors  significantly outperform naive BERT sentence embeddings.
3. SentenceBERT: Fine-tuning BERT to give good Sentence Embeddings
The idea is to fine-tune BERT sentence embeddings on a dataset which rewards models that generates sentence embeddings that have the following property:
If we obtain a model that does this, we can generate sentence embeddings for each sentence once (each forward-pass through BERT is computationally expensive), and then compute a cosine similarity for each pair (computationally rapid and cheap). This method effectively scales as O(n).
This is orders of magnitude better than having to pass in each pair of sentences through BERT. This is the current state of the art but is very computationally expensive and scales as O(n²)).
3a. (Pre)-Training Strategy: Siamese Neural Network
The general idea introduced in  is to pass 2 sentences through BERT, in a siamese fashion. A good diagrammatic summary is below:
The idea is simple enough to state. We obtain sentence embeddings for a pair of sentences. We then concatenate the embeddings as follows: (u, v, ‖u-v‖), multiply by a trainable weight matrix W∈ℝ³ᴺ ˣ ᴷ, where N is the sentence embedding dimension, and K is the number of labels.
3b. Ablation study for pooling and concatenation strategies
The pooling operation is flexible, although the researchers found that a mean aggregation worked best (compared to a max or CLS aggregation strategy).
Several concatenation strategies were tried as well; (u, v, ‖u-v‖) worked the best.
Ablation results from the paper are shown below:
3c. Inference at test time
At inference, we compute sentence embeddings and then compute the cosine similarity of the respective pairs of sentences we want to compute the semantic textual similarity of:
Interestingly enough, training (in fig 1) on the SNLI dataset, but doing inference on the STS datasets results in pretty good metrics, even though no specific training has been done on STS.
3d. Quick Look at the Datasets: STS Benchmarks and SNLI dataset
The output of the siamese network was trained to match that of a group of labeled datasets: the STS benchmarks . These datasets provide labels from 0 to 5 for the semantic relatedness of a pair of sentences:
The SNLI (Stanford Natural Language Inference) dataset contains 570k human-written English sentence pairs manually labeled (by Amazon Mechanical Turk Workers) for balanced classification with the labels: entailment, contradiction, neutral.
3e. Final Results of the Paper
We’ll quickly take a look at the final results the paper obtains:
Clearly, fine-tuning on both NLI + STS results in the best models. Interestingly enough, using RoBERTa  doesn’t seem to help that much over BERT…
Finally, note the improvement we get over using the average BERT embeddings (line 2 of the table). The result is a step improvement.
4. Looking Forward — Part II
In part II of this blog post, we’ll look at an implementation of the Siamese BERT Network in PyTorch!
Until then, keep up to date with Genei’s progress:
Website and Demo: https://genei.io/
Facebook link: here
Genei is an Ed-tech startup working on improving the productivity of students and academics by harnessing the power of NLP.
 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT
 Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
 John Pavlus. Machines Beat Humans on a Reading Test. But Do They Understand? Quanta Magazine
 Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
 Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints.
 Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
 Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197.
 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar S. Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke S. Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019.
 Reimers, N., and Gurevych, I. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, EMNLP.
 T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality. Accepted to NIPS 2013.
 Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1532–1543, 2014.
 Yang Liu. Fine-tune BERT for extractive summarization. arXiv preprint arXiv:1903.10318, 2019.
 D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017.