Transformer-based Sentence Embeddings

Deep learning NLP tutorial on analyzing collections of documents with Extractive Text Summarization, utilizing Transformer-based sentence embeddings derived from SOTA language models

Published in

The Startup

8 min readDec 22, 2020

Natural language processing (NLP) is a diverse field; the approaches and techniques are as varied as the diversity of textual samples available for analysis (eg. blogs, tweets, reviews, policy documents, new articles, journal publications etc.). Choosing a good approach requires an understanding of the questions being asked of the data, and the suitability of the available data. This tutorial, which includes a code walkthrough, aims to highlight how sentence embeddings can be leveraged to derive useful information from text data as an important part of exploratory data analysis (EDA).

Case study: CoronaNet Research Project

I make use of the CoronaNet Research Project to conduct a NLP guided assessment of government responses to the current pandemic. The details of the construction of the CoronaNet database are outlined in a paper published in Nature, and I accessed the database through the frequently updated GitHub repository. The data is available by country and links to the original sources are provided. Using the Python library newspaper3k, I scraped the original policy related news articles from several MENA region countries, including Iran, United Arab Emirates, Egypt, Palestine and Yemen. Preprocessing is done with Transformer-based sentence embeddings released last year, produced and maintained by the Ubiquitous Knowledge Processing (UKP) Lab at the Technical University of Darmstadt. These deep learning embeddings rely on the advancements made by the SOTA (state-of-the-art) language models, BERT and RoBERTa.

What is Extractive Text Summarization?

The goal of text summarization is to compress text information into a compact and coherent format, the incentive being efficiency and accessibility gains. Extractive summarization selects the subset of sentences that best represents a document (this process does not create new sentences unlike abstractive summarization), the resultant summary is composed entirely of sentences extracted from the original document. Traditionally, sentences are scored by a selected metric to measure similarity, following which the sentences are ranked in order of importance as determined by similarity score, or a centrality measurement.

Example extractive text summary of a document related to the Egyptian government’s response to Covid-19. Image by Author.

What are sentence embeddings?

Sentence embeddings can be described as a document processing method of mapping sentences to vectors as a means of representing text with real numbers suitable for machine learning. Similarity measurements such as cosine similarity or Manhattan/Euclidean distance, evaluate semantic textual similarity so that the scores can be exploited for a variety of helpful NLP tasks, including information retrieval, paraphrase identification and text summarization.

Why deep learning?

Currently, as measured by benchmarks, the best performing methods for every NLP task relies on deep learning. The past few years have produced remarkable deep learning advancements, specifically the Transformer architecture, and this has translated into SOTA NLP scores that are significantly superior to older methods such as Glove embeddings. This article has two parts, an introduction to transformer-based sentence embeddings, followed by a practical code example of extractive text summarization using Sentence-BERT (S-BERT).

Transformers and Pre-trained Language models

The revolutionary Transformer model was introduced in 2017 by Vaswani et al. The attention-only architecture was extremely impactful in the field of NLP, and it’s influence is now being felt in other fields such as computer vision, and graph neural networks. For a detailed look at Transformers, Harvard NLP’s Annotated Transformer is a wonderful tutorial. In 2018, building on this architecture, Devlin et al. created BERT (Bidirectional Encoder Representations from Transformers) a pre-trained language model, that set SOTA records for various NLP tasks, including the Semantic Textual Similarity (STS) benchmark (Cer et al. 2017).

Following BERT, RoBERTa was released by Lui et al. 2019, and this model uses a robustly optimized pre-training approach to improve upon the original BERT model. These pre-trained language models are powerful tools, and this paradigm has extended to other languages. In an introduction to Arabic NLP, I write about the application of transformers to the Arabic language and provide code examples.

Machine learning advancements in Arabic NLP

A discussion of Arabic natural language processing (NLP) for social media text, with code examples and in-depth…

towardsdatascience.com

Despite the remarkable achievements of Transformer-based language models, the construction of BERT makes it unsuitable for semantic similarity search, as well as unsupervised tasks like clustering. This is because, using BERT to find the most similar pair in a collection of 10,000 sentences would require around 50 million inference computations (n∙(n-1)/2), which would take roughly 65 hours. Furthermore, this computational overhead translates to expense, since training a BERT model for 65 hours on a 16 GB GPU (Nvidia Tesla SXM V100 at ~$10,600 USD) would require renting an excessive amount of time on a cloud GPU for a single computation.

Last year, researchers from the UKP Lab released S-BERT (Sentence-BERT), which modifies the pre-trained BERT network to use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be computed using cosine similarity (Reimers and Gurevych, 2019). Amazingly, with S-BERT, finding the most similar pair in 10,000 sentences goes from 65 hours to ~5 seconds.

Fine-tuning Sentence-BERT

In the past, neural sentence embedding methods started training from a random initialization. Instead of this, S-BERT uses pre-trained BERT and RoBERTa networks and then fine-tunes them to produce fixed-sized sentence embeddings. This is achieved using a siamese network structure as seen in the diagram below.

S-BERT architecture with a classification objective function for fine-tuning on a labeled dataset. The two BERT networks have tied weights. Image from original paper.

In this siamese network, BERT or RoBERTa can be used as the pre-trained language model, the default pooling strategy is to compute the mean of all output vectors, and u and v are the sentence embeddings. Before the softmax classifier, the concatenation step is represented as (u, v |u-v|), and this was determined to be optimal for fine-tuning. The classification objective function is optimized with cross-entropy loss and can be represented by the following equation:

Classification objective function, where the trainable weight represents the tied weights from the two models linked in a siamese network. Image by Author.

Pooling Strategy

S-BERT importantly adds a pooling operation to the output of a BERT/RoBERTA model to create a fixed-sized sentence embedding. As mentioned, the default is a MEAN pooling strategy, since this was determined to be superior to using the output of the [CLS]-token or a MAX pooling strategy. A fixed-sized sentence embedding is the key to producing embeddings that can be used efficiently in downstream tasks, such as inferring semantic textual similarity with cosine similarity scores.

Inference with Sentence-BERT

Once trained, S-BERT uses a regressive objective function for inference within a siamese network similar to the one used for fine-tuning. As seen in the diagram below, the cosine similarity between two sentence embeddings (u and v) are computed as a score between [-1…1].

A regressive objective function is used with the S-BERT architecture for inference. Image from original paper.

The regressive objective function is optimized with mean-squared-error loss, and concatenation is not required before calculating the cosine similarity of the sentence embeddings.

Implementation

There are two main options available to produce S-BERT or S-RoBERTa sentence embeddings, the Python library Huggingface transformers or a Python library maintained by UKP Lab, sentence-transformers. Both libraries rely on Pytorch, and the main difference is that using Huggingface requires defining a function for the pooling strategy, as seen in the code snippet below.

To create S-BERT sentence embeddings with Huggingface, simply import the Autotokenizer and Automodel to tokenize and create a model from the pre-trained S-BERT model (a BERT-base model that was fine-tuned on a natural language inference dataset). As seen in the code snippet below, Pytorch is used to compute the embeddings, and the previously defined MEAN pooling function is applied.

Extractive text summarization with sentence-transformers

If using sentence-transformers, there are several pre-trained S-BERT and S-RoBERTa models available for sentence embeddings. For the task of extractive text summarization, I prefer to use a distilled version of S-RoBERTa fine-tuned on a paraphrase identification dataset. As seen in the code snippet below, with sentence-transformers it is simple to create a model and embeddings, and then calculate the cosine similarity.

After calculating cosine similarity, I use code adapted from the Python library LexRank to find the most central sentences in the document, as calculated by degree centrality. Lastly, to produce a summary I select the top five highest ranked sentences.

Example summaries

For this article, I started with documents from Iran and Egypt, randomly sampling from collections of 47 and 113 documents respectively. The example summaries produced for Iran were coherent and informative. In the first example below, it is clear that the topic is about Iran’s interactions with the IMF.

Example extractive text summary of document related to Iran’s response to Covid-19.

In this second example, the summary mentions Iran’s actions to control information about Covid-19. However, the absence of dates makes it difficult to situate this information.

Second example extractive text summary of policy document related to Iran’s response to Covid-19.

Unfortunately, the samples from the Egypt dataset provided less helpful summaries, since key pieces of information were missing. In the example below, a person is referred to without identification, there are no dates mentioned, and the reference to “centers” is ambiguous.

Example extractive text summary of document related to Egypt’s response to Covid-19.

Final thoughts

Once sentence embeddings have been produced with S-BERT, extractive text summarization is only one of many NLP options available. Therefore, regarding next steps, I intend to try other techniques with Transformer-based sentence embeddings. For example, utilizing semantic textual similarity to create a query-based information retrieval system such that documents will be ranked by importance, depending on the specific query. I welcome all feedback so please feel free to connect with me on Linkedin.