NLP and Machine Learning for Document Similarity and Recommendations

How natural language processing can help find and recommend similar texts

Thomas Wood

Published in

Fast Data Science

4 min readOct 16, 2023

Finding Similar Documents with Natural Language Processing

NLP and ML for finding similar documents

Have you ever wanted to find other documents in a database that are most similar to a given document? This is referred to as the document similarity problem, or the semantic similarity problem.

Imagine the following scenarios:

You’ve got a scientific paper (including the title, abstract, and full text), and you want to find which publications are most similar to it.
You’re operating a job search website and need an efficient way to compare job descriptions.
You want to identify job candidates that are similar to an existing candidate based on the text of their CV (résumé).
You have a set of questionnaires in a field such as psychology, pharmaceuticals, or market research, and need to find similar questionnaire items and match Likert scales (item harmonisation, or data harmonisation).
In a law firm, a lawyer needs to locate similar past legal cases to help with a current case.
A dating app’s algorithm needs to recommend similar matches based on a user’s “liked” profile.

These are just a few examples, but the possibilities are endless when you use natural language processing to solve your document similarity problem.

Before you dive into the technical aspect, it’s crucial to define the problem and identify what you need your document similarity model to achieve.

What Does Your Document Similarity Model Need to Achieve?

This is one question that needs to be answered before you can build an NLP model to calculate document similarity. Typically, there won’t be a pre-existing dataset showing which documents are similar to others.

Before carrying out any data science work, it would be prudent to generate some data that you can use later to test and evaluate your model.

In some cases, it might be impossible to build a dataset that can evaluate your document similarity model. Subjectivity plays a part here, but it’s still possible to present some basic model recommendations to the stakeholders for evaluation.

Appraising a Document Similarity Model

There are several matrices you can use to judge a document similarity model. One well-known example is the Mean Average Precision. It can evaluate a search engine’s recommendation quality, and penalises models that rank relevant documents at the end of the list.

To get started, try using the mean average precision to evaluate your models on your gold standard dataset.

Bag of Words Approach to Document Similarity

Illustration of the Jaccard document similarity index calculation as a Venn diagram

The ‘bag of words’ model could be the simplest way to compare two documents; just calculate the word overlap. Names like ‘bag-of-words’ come from the fact that words are collected together in a ‘bag’, losing their sentence context.

One instance would be comparing these two sentences:

“India is one of the epicentres of the global diabetes mellitus pandemic.”
“Diabetes mellitus occurs commonly in the older patient and is frequently undiagnosed.”

Here, you can compute the Jaccard similarity index. First, remove stopwords like ‘the’, ‘and’ etc. After that, divide the number of common words in both documents by the number of different words in any of the documents. This gives you the Jaccard index.

Despite its straightforwardness, the bag-of-words models, like the Jaccard index and the similar cosine similarity, are powerful because of their speed and simplicity.

N-gram Document Similarity

The disadvantage of a bag-of-words approach is that it throws away contextual information and will treat diabetes and mellitus as independent terms. We can address this with the N-gram approach. In this strategy, all two-word, or three-word, sequences are indexed, and we calculate the Jaccard similarity index on word groups instead of individual words.

Doc2Vec: Represent Documents as Vectors

Moving upwards in complexity and performance, we can use document vector embeddings. In this method, each document is represented as a vector. The distance between these vectors gives a measure of the similarity between the documents.

However, implementing this approach can be complex. It requires deep knowledge of Natural Language Processing and Machine Learning fundamentals.

The simplest ways to use document vectors for similarities is to use an off-the-shelf LLM such as SentenceBERT from HuggingFace, or OpenAI’s API. You can read more about how we achieved this in the Harmony project here.

For more information about finding similar documents using Natural Language Processing and Machine Learning, visit the Fast Data Science website.