Simple Sentence Similarity Search with SentenceBERT

Viktor Karlsson
Mar 31 · 4 min read

SentenceBERT was built to enable applications such as semantic search in large datasets to utilise pre-trained BERT models. Let’s explore this use-case on the Quora questions dataset.

Image for post
Image for post

SentenceBERT who? 🤔

If you’re not familiar with this derivative of BERT, you’re in luck. I’ve written an article with the intension of summarising what motivated the development of this model, its architecture and performance on benchmark datasets. There, I also present its related work to get a better picture of where this sibling fits into the BERT family tree. Please go have a look! 🤖

Simple semantic search example

In order to get a feel for how SBERT performs in a search application have I created a super simple search application that allows you to experiment. The code can be found here. I’ve dedicated the rest of this article to show how the components of this app works which should make it simple for you to test it out on your own dataset. Let’s get started.

Install the required packages in a new environment. If you are using conda, that can be achieved as shown below. Note that python 3.6 is required due to some of the requirements not being compatible with later versions.

conda create --name sentence-similarity python=3.6
conda activate sentence-similarity
pip install -r requirements.txt

I’ve included a subset of the data from the Quora Questions dataset. This can be downloaded as .csv from kaggle, but has to be converted to work with my data loader. The required format is a .txt file where each row should be considered a document. A file with this format can be loaded with my dataset class.

from dataset import Dataset
data = Dataset('data/quora/quora_example.txt')

This dataset class is not especially impressive unless it’s used together with the sentence similarity measurer found in Simply initialise this class with the dataset instance.

from sentence_similarity import SentenceSimilarity
sentence_sim = SentenceSimilarity(data)

This will split each document into sentences which then can be encoded using a pre-trained SentenceBERT model. These sentence embeddings are kept in memory for fast inferences during query time. To find the most similar Quora question from our dataset to the query What is machine learning?, simply do

most_similar = sentence_sim.get_most_similar("What is machine learning?")

This will split the query into sentences (if you’ve entered more than one) and encode each of these using the same SentenceBERT model as previously. Then, each of these sentence embeddings will be compared to every sentence in the previously embedded dataset using the cosine distance metric. The closer two embeddings are, the more similar their sentences should be. It’s therefore natural to rank the search results by this metric.

The example snippets shown above can be found in which if executed should return something like this:

$ python
INFO:sentence_similarity:It took 9.319 s to embedd 552 sentences.
INFO:sentence_similarity:Extracted 1 sentences from query
INFO:sentence_similarity:Distance for top documents: [0.316, 0.335, 0.374, 0.381, 0.41, 0.45, 0.459, 0.468, 0.472, 0.478]
What is the alternative to machine learning?
What math does a complete newbie need to understand algorithms for computer programming? What books on algorithms are suitable for a complete beginner?
What's the best way to start learning robotics?
How do I learn a computer language like java?
What is Java programming? How To Learn Java Programming Language ?
What is the best material for understanding algorithmic analysis by a newbie?
How can I learn computer security?
How can you make physics easy to learn?
What are some mind-blowing computer tools that exist that most people don't know about?
What can make Physics easy to learn?

What we find is a relevant set of questions that in many ways are similar to our query sentence! Remember, all we did was to compare numerical representations of these sentences. Never did we actually search for any of the query terms. Thinking of it like that does, at least for me, indicate the potential of this method!🎉

Search application

I’ve integrated the same search functionality shown above into a simple flask app that gives it a super-basic UI for interacting with. To play around with this yourself, simply start the flask application with python This will load the same data as in the example above, but let you perform your own queries.

Image for post
Image for post

Hope this inspired you to try this semantic search application based on embeddings from Sentence BERT on a dataset of your own!

Democratizing Artificial Intelligence Research, Education…

Viktor Karlsson

Written by

Learning to write and writing to learn. Staying on top of current NLP research through sharing what I find interesting 🤖

Democratizing Artificial Intelligence Research, Education, Technologies

Viktor Karlsson

Written by

Learning to write and writing to learn. Staying on top of current NLP research through sharing what I find interesting 🤖

Democratizing Artificial Intelligence Research, Education, Technologies

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store