Mastering Text Similarity: combining embedding techniques and distance metrics

Published in

Eni digiTALKS

14 min readMay 20, 2024

“Are you paying attention?” “Are you focusing” Do these sentences mean the same? Read the article and find the algorithm’s answer!

Comparing two texts is a more pervasive and relevant task than is commonly perceived. In customer services, AI systems should be able to understand queries with synonymous meanings to generate responses that mirror the fluidity of human conversations. For instance, questions like “how can I recover my password?” or “I forgot the password, how can I log back in?” share similarity in meaning and warrant identical responses. Furthermore, it is often needed to check if an agent accurately conveys information about contracts or offers during customer interactions. Or additionally, in search engines or platforms such as Stack Overflow there is the need to understand if a question has been asked before. In essence, the ability to quickly compute text similarity becomes a cornerstone in enhancing efficiency and improving customer relationships.

While human beings understand semantic and syntactic relationships innately and effortlessly, machines face a more complex challenge to accomplish the same task.

To begin, let’s define more specifically what similarity is: similarity is the measure of how different or alike two data objects are. If the distance is small, the objects are said to have a high degree of similarity and vice versa.

Text similarity refers to how two pieces of text are close lexically (the words used) and semantically (words’ meaning). For instance, the sentences “the bottle is empty” and “there is nothing in the bottle” are semantically identical but different from a lexical point of view.

To allow machines to compute text similarity, we first have to speak machine’s language, that is the language based on numbers.

So, to evaluate text similarity one of the pivotal steps is the conversion of texts in vectors, which are indeed numeric elements representing both magnitude and orientation in space. This process is known as text vectorization or text encoding. The subsequent step in evaluating similarity is measuring the distance between these vectors. The metric chosen to calculate this distance is crucial.

In this article we will comprehensively explore advantages and disadvantages of different techniques to embed text and compute distance, to give you an extensive guide to find the best combination which fits your needs.

Before delving into details let’s have a look at the following picture which can be considered as a high-level index for this article.

Figure 2 — Text representation and text distance approaches

Text representation can be clustered in 4 macro-methods: string based, semantic text matching, corpus based (were corpus refers a collection of text documents used for linguistic analysis), and graph structure.

Text distance can be divided into length distance, distribution distance and semantic distance.

We will only cover the topics bolded in the figure above, starting from the easiest to the most complex approaches.

String based text representation

Jaccard similarity

Jaccard similarity, also known as intersection over union, measures similarity as the ratio of the common words of two texts over the total number of words.

It is described by the following formula:

Let’s consider two sentences mentioned before:

A: The bottle is empty
B: There is nothing in the bottle

According to Jaccard Similarity, sentence A and B have different meanings, as comparison is made at a literal level.

Despite being easy to calculate, Jaccard similarity is not able to capture semantic relationships, therefore it is not highly recommended, unless it is essential to merely compare words used in the texts.

According to Jaccard Similarity, sentence A and B have different meanings, as comparison is made at a literal level.

Corpus based text representation

As mentioned before, the way a word is numerically represented is crucial for evaluating similarity. We will now explore different techniques for text representation.

Bag of words

Bag of words-based techniques represent the document as a combination of a series of words regardless of their order. The most straightforward approach among this family is one hot encoding.

According to it a vector is created in the size of the total number of unique words. The value of vectors is assigned such that the value of each word belonging to its index is 1 and the others are 0.

It is generally used in situations where there is not much verbal data diversity and there is no need to represent the semantic and statistical relationships between the data. For large documents, it might lead to huge and sparse vectors.

A slightly more sophisticated approach is TF-IDF (Term Frequency — Inverse Document Frequency). It is based on the idea that words occurring frequently bear minimal meaning or significance.

Mathematically speaking:

TF = number of times a word appears in a document / the total number of words in the document

IDF = log(N/n)

Where N is the total number of documents and n is the number of documents in which the target term occurs.

To avoid the presence of 0s when a word is present in all documents 1 is added to the product TF*IDF, therefore the 0s in the vector indicate the absence of a word.

Let’s understand this with an example. Consider these sentences:

He is Walter
He is William
He isn’t Peter or September

The TF vectors are:

[0.33, 0.33, 0.33]
[0.33, 0.33, 0.33]
[0.20, 0.20, 0.20, 0.20, 0.20]

Now let’s compute the IDF score

He”: Log(3/3)= 0,
“is”: Log(3/2):0.1761,
“or, Peter, ..”: log(3/1) : 0.4771 ..

The resulting vectors are:

[1. , 1.1761 , 1.4771 , 0. , 0. , 0. , 0. , 0.],
[1. , 1.1761 , 0. , 1.4771 , 0. , 0. , 0. , 0.],
[1. , 0. , 0. , 0. , 1.4771 , 1.4771, 1.4771 , 1.4771]

This approach values statistical relationships, but still, it does not represent semantic relationship, and it isn’t suitable for long documents, as it leads to high dimensional vectors.

In general, most of the bag of words approaches do not value semantic relationships and might lead to data sparsity for large documents. So, let’s deep dive into a more complex technique for text representation, which can be classified among the word embedding ones.

Window-based methods

This approach values statistical relationships, but still, it does not represent semantic relationship, and it isn’t suitable for long documents, as it leads to high dimensional vectors.

Word2Vec

W2V is a pre-trained two-layer neural network. The W2V approach uses a trick commonly used in machine learning: a neural network with a single hidden layer is trained to perform a certain task, but it is not used for the final task, indeed the goal is just to learn the weights of the hidden layer.

W2V has two pre-training models: Continuous Bag Of Words (CBOW) and Skip-gram.

Figure 4 — W2V: CBOW and Skip-Gram description [4]

As it can be seen in the picture, in CBOW the words adjacent to the target word are given as input and the task is to predict the target word itself, while in Skip-Gram, the target word is given as input and neighbor words need to be predicted as output. The number of words to be considered as neighbors is called “window size” and is a parameter to the algorithm (a typical value for the window size might be 5).

To understand how embeddings are created, lets’ focus on Skip-Gram.

The first step for the training task is to encode words in a document, this is usually done with the one-hot-encoding approach.

Figure 5 — Focus on Skip Gram architecture [6]

The output is a single vector of the same length as the number of words in the corpus, where each element represents the probability for a word to be an input word’s neighbor.

If two words share highly comparable contexts, that is the same words are likely to surround them, our model must generate similar results for these words. A method for the network to achieve this is by ensuring that the word vectors are similar. Therefore, when two words exhibit analogous contexts, our network is incentivized to acquire comparable word vectors for these words.

Coming back to the comparison of the two types of networks: CBOW excels in learning syntactic relationships, while Skip-gram is better at grasping semantic relationships. For instance, CBOW focuses on morphologically similar words like plurals, while Skip-gram considers morphologically different but semantically related words. Additionally, Skip-gram is less sensitive to overfitting frequent words, as it relies on single-word input, making it more efficient in terms of document requirements for optimal performance.

W2V embedding is implemented in Spacy or in Genism.

This approach addresses high dimensionality problems and considers semantic and syntactic relationships, however, presents a limitation as it does not account for the word’s context, therefore performs badly in case of polysemy. For instance, the word “current” has two different meanings in the following sentences:

It’s a current affairs program
The current runs quickly under the bridge

According to the W2V approach the world current would have just one representation.

Contextual models address this issue, taking into account the sequence of all words in the document to embed a target word.

Let’s explore one of the most prominent algorithms in the world of NLP: BERT.

BERT

We will provide a concise overview of the algorithm, emphasizing the crucial aspects of architecture and training necessary to comprehend the achievement of contextual embedding. For a more in-depth exploration of BERT, we recommend delving into additional resources, linked at the conclusion of this article [3].

BERT states for Bidirectional Encoder Representation from Transformers. As its name suggests, BERT architecture is based on transformers, indeed it uses bidirectional language transformers for language representation.

BERT is pretrained for two different tasks: Masked Language Modelling (MLM) and Next Sentence Prediction (NSP).

Let’s start with the first one. 15% of the words present in the corpus are considered to be “masked”. Among them:

80% of the words is replaced with the masked token [MASK]
10% is replaced with random words
10% is left unchanged.

The task is to predict the masked words. In particular each way of masking the tokens is crucial for the model:

Replacing a word with the token [MASK], instead giving a generic token and predicting what it should be makes the model able to infer a token only from its surrounding text.
Replacing a sampled word by a random word from the text and predicting what it should be enhances the model’s resilience against incorrect tokens, the model learns to handle unexpected or out-of-context words effectively.
Leaving a sampled word unchanged and predicting it makes the model preserve the original semantics and syntactic structure of the text. It makes the model resilient against erroneous contexts.

The combination of these three ways of masking words contributes to the robustness and versatility of the model in various NLP tasks.

The second task, NSP, aims at learning relationships between sentences: for 50% of the pairs of sentences in the corpus the second sentence actually is the next sentence; for the remaining pairs, the second sentence is a random sentence. The first case is labeled as “isNext”, the second one as “NotNext”. The task is to predict the right label and allows BERT to learn relationships between sentences (for example question and answers)

During the training of the BERT model, the Masked Language Model (MLM) and Next Sentence Prediction (NSP) components are trained together, aiming to minimize the combined loss function arising from these two strategies predict the right label.

The power of BERT derives on how it models input for accomplishing these tasks.

Each input embedding is a combination of 3 embeddings:

Position Embeddings: Embeddings used to express the position of words in a sentence. These elements are introduced to address the limitation of the Transformer, which, unlike an RNN, lacks the ability to capture information related to sequences.
Segment Embeddings: Embedding which identifies sentence pairs. BERT learns a unique embedding for the first and the second sentences to help the model distinguish between them. In the picture below, all the tokens marked as EA belong to sentence A and similarly for EB.
Token Embeddings: [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.

Each input is obtained summing the 3 embeddings mentioned above.

After being pre-trained, the model can be fine-tuned on a specific corpus.

The pre-trained version of BERT can be used for specific tasks such as: classification (sentiment analysis), question answering, Named Entity Recognition. However, we mentioned this algorithm among the techniques for embedding, therefore we will shortly explain how to use it for this purpose. The idea behind this application is similar to the one mentioned for W2V, indeed we don’t use the model for the task for which it was pretrained. BERT base model employs 12 layers of transformer encoders, and the output for each token from each layer can be used as a word embedding. Based on empirical findings, the authors determined that a highly effective approach is to sum the outputs from the last 4 layers. Embedding with BERT can easily be implemented in Python as Hugging Face open-source library gives the possibility to use BERT in PyTorch or in TensorFlow.

Distance metrics

Up to this point, our exploration focused only one method for measuring text similarity (Jaccard similarity) and on several techniques for text vectorization. As stated in the introduction, transforming words into vectors is just the preliminary step in assessing similarity, the computation of a distance metrics is needed. The subsequent sections will delve into a comprehensive examination of these distance metrics.

Length distance metrics

According to length distance metrics, distance is measured using the numerical characteristic of the text. The most popular metric among them is certainly Euclidean distance.

Euclidean distance

Euclidean distance uses the Pythagoras theorem to calculate the distance between two points.

Let’s consider two vectors of length n, the Euclidean distance is described by the following formula:

and can be visualized in the following picture:

The larger the distance d between two vectors, the lower the similarity score and vice versa.

Euclidean distance has some limitations. Firstly, the value obtained is hard to understand unless there is something to compare to. To address this issue distances might be normalized, but the most commonly known formula

Formula 2 — standardization

is highly sensitive to the presence of outliers.

Secondly, Euclidean distance is strongly affected by text magnitude, therefore it doesn’t work well with sparse vectors (such as the ones deriving from one hot encoding).

Cosine similarity

Cosine similarity computes the similarity of two vectors measuring the cosine of the angle between the vectors. Instead of measuring the distance between two points, it verifies if two vectors are pointing in the same direction. Therefore, it is not affected by the magnitude of the vectors.

Cosine similarity is computed through the following formula:

Let’s try to better understand the implications of being affected by vectors to compare these metrics.

Suppose we compare two papers, and we would like to find which one is related to politics and which one is related to sports. If the word “baseball” occurs more in paper 1 than in paper 2, according to Euclidean distance, paper 1 is more related to sports. However, it might be that paper 1 was just longer than paper 2. And still, paper 2 might be more related to sports than paper 1.

Most text similarities use cases aren’t length sensitive, therefore, in general, cosine similarity is preferred over Euclidean distance. A use case in which Euclidean distance might be preferred is plagiarism detection, as it is length sensitive.

Sematinc distance

In situations where there aren’t many common words between texts, the similarity derived from distance measures reliant on length or distribution may appear relatively small. In such cases, is advisable to choose semantic distance calculations.

The primary method employed for this purpose is Word Mover’s Distance, which facilitates the determination of semantic proximity between texts.

Word mover’s distance

Word mover’s distance defines the dissimilarity between two text documents as the minimum amount of distance that the embedded words on one document need to travel to reach the embedded words of another document. Therefore, the measurement of similarity becomes a transportation problem: minimize the cost of transporting text1 to text2.

Word Mover’s Distance originates from an optimization problem called Earth Mover’s Distance, which measures similarity between two probability distributions, considering the cost of transforming one distribution into another.

The first step for computing WMD is word embedding with Normalized Bag of Words. Secondly, Euclidean distance is computed and finally the optimization problem is computed. The objective function of the optimization problem is to minimize the distance needed to travel from one document to another, with the constraints that the total amount of “mass” must be conserved, so, said in other words, constraints ensure that the entire content of both documents is considered and that no word is duplicated or lost in the process from travelling from one word to another.

WMD is far expensive to calculate, especially for long documents. And as a method which uses every word’s presence, regardless of ordering, it still isn’t strong in cases where tiny grammatical changes — such as the addition of a ‘not’ in the right place. However, it is a very effective for documents with only a few words in common, as it considers words’ similarities in embedding space.

Conclusion

Measuring semantic similarity is one of the most complex challenges in the world of natural language processing. Our journey through the landscape of text similarity measurement has unveiled a diverse array of methods, each with its unique strengths and limitations.

String-based methods are simple and easy to implement, however, they do not address semantic relationships. On the other hand, corpus-based methods, despite being more complex to implement, value semantic, statistical relationships and are versatile for multiple languages.

Concerning distance metrics, some of them are length-sensitive, others rely on the distribution or focus on the semantics. The combination of these metrics and text representation are endless.

In this evolving landscape, where the quest for the perfect model persists, one truth remains clear: the choice of the best method is inherently tied to the unique demands of each individual use case. As we chart the course forward, the synergy between representation learning and distance calculation paves the way for ever-improving text vectors, heralding a new era in our understanding of semantic similarity.

References

[1] Qiu, Xipeng, et al. “Pre-trained models for natural language processing: A survey.” Science China Technological Sciences 63.10 (2020): 1872–1897.

[2] Goldberg, Yoav, and Omer Levy. “word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method.” arXiv preprint arXiv:1402.3722 (2014).

[3] Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).

[4] Wang, Jiapeng, and Yihong Dong. “Measurement of text similarity: a survey.” Information 11.9 (2020): 421.

[5] Kusner, Matt, et al. “From word embeddings to document distances.” International conference on machine learning. PMLR, 2015.

[6] [Online]. Available: https://www.newscatcherapi.com/blog/ultimate-guide-to-text-similarity-with-python.

[7] McCormick, Chris. “Word2vec tutorial-the skip-gram model.” Apr-2016.[Online]. Available: http://mccormickml. com/2016/04/19/word2vec-tutorial-the-skip-gram-model (2016).