Exploring Contextual Text Similarity: A Dive into Machine Learning Techniques

8 min readJan 5, 2024

In today’s data-driven world, the ability to discern similarities between texts has become paramount. From information retrieval to recommendation systems, the understanding of contextual text similarity forms the bedrock of numerous applications. With the advent of machine learning, particularly natural language processing (NLP), the methods for gauging textual resemblance have evolved, allowing for more nuanced and accurate comparisons.

Contextual text similarity refers to the measurement of how closely related or similar two pieces of text are in meaning, despite variations in structure, wording, or length.

Traditional methods often relied on static techniques like cosine similarity or Jaccard index, which treat texts as bags of words, disregarding the context in which words appear. However, deep learning has unlocked the potential to capture contextual nuances, revolutionizing the approach to text similarity.

There are basically two different ways to find similarity between texts:

Lexical Similarity: provides a measure of the similarity of two texts based on the closeness of the word sets of same or different languages
Semantic Similarity: Semantic Similarity creates a quantitative measure of the likeness of meaning between two words or phrases

Popular Models for Text Similarity

There are several machine learning models used to assess the performance of text similarity tasks, ensuring that machine learning models accurately capture the resemblance between texts. Some of the popular models include:

Lexical Text Similarity models: There are several different ways of evaluating lexical similarity such as Cosine Similarity, Jaccard Similarity, Sørensen–Dice coefficient, Levenshtein Distance etc
Semantic Text Similarity models: Semantic similarity can be evaluated by using various different algorithms like Word/Sentence embeddings, Contextual language models, sentence transformers etc

Let’s take some example to understand it further

Consider the following two example:

Example:

My house is empty today
Nobody's at my home

In the above example both of the sentences convey same meaning but have different words. Let's test this two example using various text similarity metrics.

Lexical Text Similarity using python

To test our examples we need to convert the sentences to numeric embedding vectors so that we can use them for computation. There are several ways to do this:

Bag of Words (BoW) is a collection of classical methods to extract features from texts and convert them into numeric embedding vectors. We then compare these embedding vectors by computing the cosine similarity between them. There are two popular ways of using the bag of words approach: Count Vectorizer and TFIDF Vectorizer.
Count Vectorizer: Count Vectorizer maps each unique word in the entire text corpus to a unique vector index. The vector values for each document are the number of times each specific word appears in that text. While Count Vectorizer is simple to understand and implement, its major drawback is, it is bias towards Frequent Words, It assigns higher importance to words that frequently appear in documents.

To understand count vectors further, Let us take these two sentences:

The cat ate the mouse
The mouse ate the cat food

Count vectors for the above sentences will look something like this:

Fig. Count Vectorizer Example. Image by Author.

TFIDF Vectorizer: To overcome the drawback of the Count Vectorizer, we can use the TFIDF vectorizer. This algorithm also maps each unique word in the entire text corpus to a unique vector index. But instead of a simple count, the values of the vector for each document are the product of two values: Term Frequency (TF) and Inverse Document Frequency (IDF).

Here’s a breakdown of TF-IDF:

Term Frequency (TF): This component measures how frequently a term appears in a document. It’s calculated as the ratio of the number of times a term occurs in a document to the total number of terms in that document. It aims to reflect the importance of a term within a specific document.

Inverse Document Frequency (IDF): IDF evaluates the importance of a term across the entire corpus by penalizing terms that occur frequently across many documents. It’s calculated as the logarithm of the ratio of the total number of documents in the corpus to the number of documents containing the term, often with smoothing to prevent division by zero.

The higher the TF-IDF score for a term within a document, the more important or relevant that term is to that document compared to the rest of the corpus. Terms with high TF-IDF scores are often significant in representing the content of a document. Most implementations of TFIDF normalize the values to the document length so that longer documents don’t dominate the calculation.

Implementing TF-IDF with Cosine similarity in Python

In the above example we can clearly see that the text similarity score is fairly low as there are not many matching words but the true meaning of both the sentences are same.

Testing text Similarity with other lexical methods

In all the above cases the similarity is calculated based on matching words (Lexical text similarity methods). So the scores generated are quite low as both the sentences have different words while they have similar meaning.

To address these limitations, approaches like word embeddings, contextual language models, and more advanced techniques in natural language processing have been developed. These methods capture semantic meaning, context, and relationships between words or phrases, thereby providing more accurate and nuanced measures of text similarity. They offer a better understanding of language semantics and context, overcoming many of the shortcomings of lexical text similarity methods.

Semantic Text Similarity using python

Semantic text similarity refers to the measurement of how closely related or similar two pieces of text are in meaning, context, or semantics. Unlike lexical similarity that focuses on word matching or surface-level comparisons, semantic similarity aims to capture the actual meaning and context of the text.

Several methods are used to assess semantic text similarity:

Word Embeddings: Techniques like Word2Vec, GloVe, and fastText create dense vector representations for words in a continuous space, capturing semantic relationships. Similar words are closer together in this space, allowing for measuring word similarity beyond lexical overlap.

For example, the figure below shows the word embeddings for “king”, “queen”, “man”, and “women” in a 3-dimensional space:

Fig. Word embeddings in a 3 dimensional space

Sentence Embeddings: These methods generate fixed-length vectors representing entire sentences or phrases, considering the meaning and context of the text. Techniques like Universal Sentence Encoder and InferSent create representations that capture the semantics of the text.
Contextual Language Models: Pre-trained models like BERT, GPT (Generative Pre-trained Transformer), and their variants learn contextual representations of words, sentences, or documents by considering the surrounding context. They excel in understanding nuances and context-specific meanings, enabling accurate text similarity measurements.
Semantic Graph Models: Graph-based approaches, such as knowledge graphs or semantic networks, represent words or concepts as nodes connected by edges denoting relationships. Similarity is measured based on graph structures or distances between nodes.
Sentence Transformers: Models specifically designed to encode sentences into fixed-length vectors, leveraging transformer architectures to capture semantic information. These models enhance text similarity tasks by understanding the semantic content of sentences.

Semantic text similarity approaches address many limitations of lexical methods by focusing on understanding meaning, context, and relationships between words.

Implementation of BERT with Cosine similarity using Python

BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.

Source: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT Base Uncased is a Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model is uncased which means it does not make a difference between upper or lowercase, for example, "english" = "English"

DistilBert implemenation using python

The DistilBERT model was proposed in the the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

Fig. DistilBERT implementation using Python

We can clearly see that semantic models like BERT is giving us a better score and is able to understand the context behind the sentences.

Wrap up

We evaluated the performance of various Lexical Models like TF-IDF (Term frequency — Inverse document frequency) with cosine matrix, Jaccard, sorensen dice and semantic models like BERT and DistilBert. We can clearly see from the results that, semantic models like BERT is giving us a better score and is able to understand the context behind the sentences whereas lexical models give good results only when keywords are matching.

Fig. Lexical and Semantic Similarity Result. Image by Author.

The integration of machine learning techniques in contextual text similarity has revolutionized the understanding of textual relationships. Leveraging sophisticated models and embeddings has not only improved accuracy but also opened doors to myriad applications across diverse domains. As advancements continue, the quest for more robust and context-aware text similarity methods persists, promising a future where machines comprehend textual meaning akin to human cognition.