Sentence BERT — The basics

Akash Shetgar
4 min readJul 16, 2023

--

This article will help you get a basic understanding of Sentence BERT. We will take a step-by-step approach in this article. This will help you have a grasp on how the models developed over time.

First, we will start by understanding what Transformers do. Next, we will talk about how transformers are used in BERT and the concept of masking. Finally, we’ll have a look at SBERT.

Photo by Goran Ivos on Unsplash

Let’s start with Transformers

In the world of natural language processing (NLP), transformers have emerged as a powerful tool. These models excel at processing and understanding text by leveraging the context of the entire input. This ability to maintain the memory of previous words used in a prompt makes transformers highly effective. Transformers first tokenize the words and convert them into word embeddings.

Word embeddings are data structures that convert words into numerical vectors for efficient data processing. The size of the vector depends on the model’s corpus size, which represents the maximum number of tokens it can process. To ensure a fixed-size input, transformers tokenize the words and convert them into word embeddings. If the input text is shorter, padding is applied to maintain consistency.

Words with similar meanings are closer. Source — https://www.cs.cmu.edu/~dst/WordEmbeddingDemo/figures/fig5.png

Transformers are more efficient than RNNs and LSTMs which use recurrent connections for memory.

BERT: Bidirectional Encoding

BERT (Bidirectional Encoder Representations from Transformers) is a model inspired by transformers but introduces two key differences. The first is the masked language model, which involves hiding a portion of text within a sentence and requiring the model to predict the masked text using the surrounding words. This bidirectional approach contributes to BERT’s name and enhances its contextual understanding.

The second difference is the next sentence prediction mechanism. BERT learns the relationships between sentences by training on a combination of 50% correct sentence pairs and 50% randomly paired sentences. This training enables BERT to predict the next correct sentence accurately.

https://editor.analyticsvidhya.com/uploads/13216BERT_MLM.png

While finding similarities between words is a feasible task in BERT, it is not feasible using sentences. What BERT does to compare sentences is it takes 2 sentences separated by a separator token and then finds the similarity between them using cosine similarity and this takes a lot of time.

To train about a thousand sentences BERT would take a lot of time as it has to take in every permutation of the sentence pairs. It is the same case for performing some antique search which is a task to find the sentence more similar to the given query in a data set.

SBERT

SBERT is a modification of BERT. Made to find similarities between sentences. It uses Siamese architecture. Siamese models consist of two or more same submodels(here, BERT) within itself having the same parameters/ weights. This kind of model is used for computing similarities between two different things.

Just like BERT, this network also takes two sentences at once but it does not train every permutation of the sentences — It uses triplet loss.

The triplet loss function in SBERT is used during the training process. It requires triplets of sentences as input, where each triplet consists of an anchor sentence, a positive sentence, and a negative sentence. Triplet loss is defined as the difference between the distances of the anchor-positive pair and the anchor-negative pair.

The anchor sentence serves as a reference point, and the goal of the triplet loss is to ensure that the positive sentence (a sentence similar to the anchor) is closer to the anchor in the embedding space compared to the negative sentence (a dissimilar sentence).

As NLP continues to evolve, these models will play a crucial role in various applications, from machine translation and sentiment analysis to chatbots and information retrieval systems. By harnessing the capabilities of transformers, we unlock new possibilities in understanding and processing human language.

References:

--

--