Two minutes NLP — Sentence Transformers cheat sheet

Sentence Embeddings, Text Similarity, Semantic Search, and Image Search

Fabio Chiusano
NLPlanet
4 min readJan 10, 2022

--

Use-cases of the SentenceTransformers library. Image by the author.

SentenceTransformers is a Python framework for state-of-the-art sentence, text, and image embeddings. Embeddings can be computed for 100+ languages and they can be easily used for common tasks like semantic text similarity, semantic search, and paraphrase mining.

The framework is based on PyTorch and Transformers and offers a large collection of pre-trained models tuned for various tasks. Further, it is easy to fine-tune your own models.

Read the paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks for a deep dive into how the models have been trained. In this article, we’ll see code examples for some of the possible use-cases of the library. Model training will be covered in a later article.

Library installation

Before delving into code, install the SentenceTransformers library with pip.

Get sentence embeddings

The first example we see is how to obtain sentence embeddings. SentenceTransformers makes it as easy as pie: you need to import the library, load a model, and call its encode method on the sentences that you want to encode.

Semantic Textual Similarity

Once that we have the embeddings of our sentences, we can compute their cosine similarity using the cos_sim function from the util module.

Semantic Search

Semantic search seeks to improve search accuracy by understanding the content of the search query instead of relying on lexical matching only. This is done leveraging similarities between embeddings.

The idea behind semantic search is to embed all entries in your corpus into a vector space. At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found.

Example of Semantic Search in a vector space. Image from https://www.sbert.net.

Semantic Search can be performed using the semantic_search function of the util module, which works on the embeddings of the documents in a corpus and on the embeddings of the queries.

In order to get the best out of semantic search, you must distinguish between symmetric and asymmetric semantic search, as it heavily influences the choice of the model to use.

Paraphrase Mining

Paraphrase mining is the task of finding paraphrases, i.e. texts with very similar meaning, in a large corpus of sentences.

This can be achieved using the paraphrase_mining function of the util module.

Image Search

SentenceTransformers provides models that allow embedding images and text into the same vector space: this allows to find similar images as well as to implement Image Search, i.e. using text to search for images and vice-versa.

Example of texts and images in the same vector space. Image from https://www.sbert.net.

To perform Image Search, you need to load a multimodal model like CLIP and use its encode method to encode both images and texts.

The embeddings obtained by multimodal models allow performing tasks like image similarity as well.

Other tasks

  • For complex search tasks like question answering retrieval, semantic search can significantly be improved by using a Retrieve & Re-Rank pipeline.
The architecture of a Retrieve & Re-Rank pipeline. Image from https://www.sbert.net.
  • SentenceTransformers can be used in different ways to perform the clustering of small or large sets of sentences.
Example of topic modeling from document embeddings. Image from https://www.sbert.net.

--

--

Fabio Chiusano
NLPlanet

Freelance data scientist — Top Medium writer in Artificial Intelligence