How to Measure Text Similarity: A Comprehensive Guide

Ahmet Münir Kocaman
3 min readOct 7, 2023

--

Text similarity is a critical concept in various applications, such as search engines, chatbots, plagiarism detectors, and recommendation systems. Being able to quantify how similar two pieces of text are to one another can provide insights and improve system efficiency. In this article, we’ll delve into several methods to measure text similarity.

1. Introduction

Text similarity is the process of determining how ‘close’ two pieces of text are in (1) meaning or (2) surface closeness. The former relates to semantics, while the latter is about string character sequences.

2. Methods to Measure Text Similarity

2.1. Cosine Similarity with TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical representation of a document. It weighs the importance of a word in a document against its frequency in a corpus.

Cosine similarity then measures the cosine of the angle between two non-zero vectors. When applied to text, it measures the cosine of the angle between two TF-IDF vectors.

Advantages:

  • Takes into account term frequency and importance.
  • Good for large documents.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

texts = ["Your first text here", "Your second text here"]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)
similarity = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])[0][0]
print(similarity)

2.2. Jaccard Similarity

It measures similarity as the intersection divided by the union of the sets of tokens of the two texts.

Jaccard similarity=Intersection of A and BUnion of A and BJaccard similarity=Union of A and BIntersection of A and B​

Advantages:

  • Simple and intuitive.
  • Good for small texts.
def jaccard_similarity(str1, str2):
set1 = set(str1.split())
set2 = set(str2.split())
intersection = set1.intersection(set2)
union = set1.union(set2)
return len(intersection) / len(union)

str1 = "Your first text here"
str2 = "Your second text here"
print(jaccard_similarity(str1, str2))

2.3. Levenshtein Distance (Edit Distance)

It represents the number of edits (insertions, deletions, or substitutions) to change one string into another.

Advantages:

  • Measures the ‘distance’ between two texts.
  • Useful for spell checking and DNA sequence alignment.
import numpy as np

def levenshtein_distance(s1, s2):
len_s1, len_s2 = len(s1) + 1, len(s2) + 1
dp = np.zeros((len_s1, len_s2))
for i in range(len_s1):
dp[i][0] = i
for j in range(len_s2):
dp[0][j] = j
for i in range(1, len_s1):
for j in range(1, len_s2):
cost = 0 if s1[i-1] == s2[j-1] else 1
dp[i][j] = min(dp[i-1][j] + 1, dp[i][j-1] + 1, dp[i-1][j-1] + cost)
return dp[-1][-1]

str1 = "Your first text here"
str2 = "Your second text here"
print(levenshtein_distance(str1, str2))

2.4. Word Embeddings (Word2Vec, GloVe, FastText)

Word embeddings transform words into dense vectors where semantically similar words are close in the vector space. The similarity between two texts can be computed as the average of their word vectors.

Advantages:

  • Captures semantic meaning.
  • Can understand synonyms and context.

Here is the example for word2vec:

import gensim.downloader as api
from scipy.spatial.distance import cosine

# Load pre-trained word2vec model
model = api.load('word2vec-google-news-300')

def get_average_word2vec(tokens_list, vector, generate_missing=False, k=300):
if len(tokens_list)<1:
return np.zeros(k)
if generate_missing:
vectorized = [vector[word] if word in vector else np.random.rand(k) for word in tokens_list]
else:
vectorized = [vector[word] if word in vector else np.zeros(k) for word in tokens_list]
length = len(vectorized)
summed = np.sum(vectorized, axis=0)
averaged = np.divide(summed, length)
return averaged

def cosine_distance_wordembedding_method(s1, s2):
vector_1 = get_average_word2vec(s1.split(), model)
vector_2 = get_average_word2vec(s2.split(), model)
cosine_distance = cosine(vector_1, vector_2)
return 1-cosine_distance

str1 = "Your first text here"
str2 = "Your second text here"
print(cosine_distance_wordembedding_method(str1, str2))

2.5. BERT and Transformers

Recent advancements in NLP, such as BERT and other transformer models, provide contextual embeddings. These models can generate sentence or paragraph embeddings to measure text similarity.

Advantages:

  • Captures deep and contextual semantics.
  • State-of-the-art performance in many NLP tasks.
import torch
from transformers import BertTokenizer, BertModel

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def get_bert_embedding(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
return outputs['last_hidden_state'][:,0,:].numpy()

def cosine_similarity_bert(s1, s2):
emb1 = get_bert_embedding(s1)
emb2 = get_bert_embedding(s2)
similarity = 1 - cosine(emb1, emb2)
return similarity

str1 = "Your first text here"
str2 = "Your second text here"
print(cosine_similarity_bert(str1, str2))

3. Practical Considerations

  • Size of Texts: Some methods are more suited for longer documents (TF-IDF) while others work well with shorter texts (Jaccard).
  • Speed vs. Accuracy: While deep learning methods like BERT provide excellent results, they can be slower than traditional methods.
  • Domain-specificity: In some cases, training domain-specific embeddings or models can yield better results.

4. Conclusion

Measuring text similarity is a multifaceted challenge. The method to choose depends on the nature of the texts, the application, and the computational resources available. By understanding the advantages and trade-offs of each method, you can select the most appropriate one for your needs.

--

--