A beginner’s guide to measuring sentence similarity

10 min readApr 19, 2023

By Anupriya Saraswat

In a previous blog, my colleague Pushpam Punjabi explained the concept of word embeddings and how they can be leveraged for making the computers understand the semantics of a word. In this blog, I will take one more step into the world of text similarity. I will discuss sentence similarity!

Have you ever wondered why when you type “who won the FIFA world cup this year,” that search engines do not just return a page with that answer, but hundreds of other pages containing similar subjects such as the past history of FIFA world cup winners, or about Messi and Mbappe, or about Argentina’s tryst with the FIFA world cup, etc. Sentence similarity has a cool role to play behind such scenarios!

Let me describe how to measure the similarity of sentences or paragraphs. Read on!

What are sentence embeddings?

I’ll start this discussion with the concept of sentence embeddings. Similar to word embedding, sentence embedding represents a sentence as a vector of numbers. This numerical representation of a sentence is called sentence embedding. As you may recall from Pushpam’s article, each dimension in a word embedding corresponds to a particular feature or aspect of the word. A sentence embedding is based on a similar concept where the dimensions collectively capture different aspects of the words used in the sentence, the grammatical structure of the sentence, and maybe some more underlying information.

There are various ways in which a sentence embedding can be created. Once we have each sentence represented as a vector of numbers, then the problem of finding sentence similarity translates to the problem of finding similarity between these numeric vectors. In this blog, I will discuss a couple of statistical techniques to create numeric representations of sentences and briefly explore an idea of how you can utilize the previously calculated word embeddings for the same task. I will also discuss how you can compute similarity between sentence embeddings.

Bag of words

I will start the discussion with a simple solution to create sentence embeddings. The basic idea is to find out which words are present in a sentence and assess the importance of a word based on how many times it occurs in a sentence. I can then create a vector using the same information.

I will explain this step by step with an example:

Step 1: I need some sentences to work with. I’ll use the following examples from the fictional Dothraki royal family:

Step 2: I’ll create a dictionary of all the words present in these sentences. While making this dictionary, I’ll remove the commonly used words such as “a”, “are”, “the”, “is”, etc. These words are called stop words. In creating a vector, you remove the stop words so that you can focus on the important words instead. The dictionary looks something like the following:

Step 3: Next, let’s create an embedding for one sentence. I will use k words in the dictionary to create a k-value vector, where each value in the vector represents one word in the dictionary. For each word, I will store how many times this word occurs in the sentence.

Let’s understand this with an example of sentence 2 –

After removing stop words, this sentence contains four words, and their frequencies are as follows:

Given that there are 16 unique words in the dictionary, I’ll create a 16-number vector and fill each word with its respective count of number of times it appears in sentence 2. This vector looks like the following:

Similarly, I can calculate embeddings for all seven sentences:

Step 4: Now that I have represented a sentence as a vector of numbers, let’s measure similarity of two vectors. One of the most commonly used ways of calculating similarity of an array of numbers is cosine similarity. Let me take a short detour and discuss the basic concept of cosine similarity.

Consider that two n-dimensional arrays are plotted as two vectors in an n-dimensional space. Cosine similarity measures the angle between these two vectors and returns a value between -1 and 1. Mathematically, given two vectors A and B, cosine similarity is calculated as follows:

where,

A.B = Dot product between two vectors. It is calculated by adding the product of corresponding vector values.

|A|, |B| = Magnitude of a vector. It is the square root of the sum of squares of all the vector values.

How does one interpret the values of cosine similarity?

The value of cosine similarity ranges from -1 to 1. The higher the value, the more similar the sentences.

similarity = 1 means that the vectors are the same and have the highest order of similarity. Vectors in the first image have a similarity value close to 1.
similarity = 0 means that the vectors are not necessarily related. This case can be visualized from the second image above.
similarity = -1 means that the vectors are opposite and have the highest order of dissimilarity. This usually means that they are related in a contrasting way much like antonyms. The third image depicts this.

Let’s compute the cosine similarity for sentences 6 and 7 from our example:

That means that sentences 6 and 7 are 68% similar. The presence of terms like “Khaleesi”, “dragon”, and “Drogon” in both the sentences is giving us this high value of similarity. The remaining 32% is omitted because “Khal” and “beautiful” are there in only one of the sentences. Similarly, sentences 1 and 2 are merely 28% similar because the only word they have in common is “Khal.”

Note: Since I used the bag of words technique to calculate these embeddings, the presence/absence of exact terms in the sentences is directly affecting the similarity value.

TF-IDF

The bag of words approach gives equal weight to all words. However, a more sophisticated approach is the TF-IDF approach. TF-IDF stands for Term Frequency — Inverse Document Frequency. This approach is based on the rationale that the most common words are usually the least significant ones. While stop words are removed in the bag of words approach, TF-IDF provides a more sophisticated approach to automatically give less weight to frequent words.

Before I explain this approach, note that any piece of text that is vectorized using TF-IDF is referred to as a document here. The piece of text can be a sentence, paragraph, or a document (literally).

TF-IDF is made up of the following two components:

Term Frequency (TF): Frequency of how many times a word appears in the document. This is the same as what I demonstrated in the bag of words technique.
Inverse Document Frequency (IDF): It is a measure of a word’s importance in the whole corpus. This particular component makes it advantageous over the bag of words technique. IDF is calculated by taking the logarithm of the ratio of the total number of documents and the number of documents containing the word (document frequency). The more frequently the word appears across the corpus, the lower its inverse document frequency making it less important. Similarly, the rarer the word in the corpus, the higher its inverse document frequency. Mathematically,

The TF-IDF value for a word can be calculated as follows:

Let’s understand this step by step with an example:

Step 1: We need a corpus. Consider the same set of sentences about the fictional Dothraki royal family that I used for the bag of words approach.

Step 2: We create a dictionary of words present in the corpus. This step is also the same as the bag of words approach. The resulting dictionary after removing the stop words will look like the following:

Step 3: Next, I need to create an embedding for each sentence. If there are k words in the dictionary, then I’ll first create a k-value vector, where each value in the vector represents one word in the dictionary. Here, for each word instead of just taking the occurrence count, I’ll use its TF-IDF value.

Let’s understand this with an example of sentence 2 –

I’ll compute the TF and IDF values for all words in the dictionary in the context of sentence 2:

The TF-IDF vector for sentence 2 will be: [0, 0, 0, 0, 0, 0, 0, 0.2, 0, 0, 1, 0, 0, 0.5, 0.8, 0]
Note that the IDF value for the common words such as “Khal” and “Khaleesi” is as low as 0.1, whereas that for the rarer words such as “Dothraki” and “horse” it is as high as 0.8.

Step 4: Similarly, I’ll calculate embeddings for all seven sentences as follows:

Step 5: Similar to the bag of words approach, I can now compute similarity between sentences by computing cosine similarity between these sentence embeddings.

Recall that the similarity score for sentences 1 and 2 using the bag of words approach was 28%. Then why for the same pair of sentences, do the two methods give such different scores? Note that the two sentences have only one word in common which is “Khal”. This word has a very low IDF value 0f 0.1 which directly indicates that this word is so common across the corpus that it has lost its significance.

Note: The way I calculated similarity (cosine similarity) will not change depending on how I calculate the embeddings, but it will change the way we interpret the similarity value.

Sentence embeddings from word embeddings

While statistical techniques are a good starting point when it comes to sentence embeddings, they are still not good at capturing the semantics of the sentences. For example, consider sentences 2 and 3 in the example:

They don’t have words in common, but they are similar as they are talking about the key personality traits of the two leaders of the Dothraki. If I use the embeddings created by either of the statistical techniques, the similarity value will be zero, meaning they are not related at all, but this should not be the case. Let’s see how to tackle this!

Here is a thought! Sentences are essentially a combination of words. So, what if you can somehow put the word embeddings (calculated in the previous blog) together to get the sentence embeddings? The word embeddings blog represented the semantics of individual words using individual arrays. For sentence embeddings, you can aggregate those individual arrays to represent a sentence.

Let’s understand this step by step:

Step 1: Let’s take sentences 2 and 3 from the set of examples.

Step 2: After removing the stop words, we are left with the following words for the two sentences:

Step 3: According to the previous blog, these words can be represented in a two-dimensional vector as follows:

Step 4: I’ll create a sentence vector by taking the average of the 2D vectors of each word in the sentence. Each sentence is then represented as a 2D vector.

Note that I took the average of all the values because it is an easy and intuitive way of representing a set of values and it does not get biased by multiple occurrences of the same word.

Step 5: Using these embeddings, I can compute the cosine similarity between these two sentences, which comes out to -0.8. This implies that the two sentences are strongly related but they are talking about contrasting characteristics. This right here is the utility of capturing semantics and not just statistics from the sentences.

Note that I’m presenting some simplistic approaches to address the problem of sentence similarity. There exist more sophisticated solutions in this space, which will be a subject of our next blog!

Applications

As discussed in Pushpam’s previous blog, document embeddings provide a powerful lever in the space of search engines, document classification, document translation, conversational intelligence, among others. ignio also uses embeddings for extracting important contextual information from descriptions present in events, incidents, and service requests.

Conclusion

In this blog I discussed the concept of sentence embeddings and some simple approaches of calculating them. Simply put, sentence embeddings are how you would explain a sentence to a machine. Once the machine understands a sentence, you can use this understanding to perform several tasks. I focused primarily on sentence similarity here. I discussed cosine similarity for quantifying it. Sentence similarity is a versatile and powerful tool that is currently fueling a huge range of applications in the real world.

About the author

Anupriya Saraswat is a Machine Learning Engineer at Digitate who specializes in the space of text analytics.