How we reduced our text similarity runtime by 99.96%

Published in

Data Science at Microsoft

7 min readMar 23, 2021

In many NLP pipelines, we wish to compare a query to a set of text documents. The process usually involves some encoding or embedding of the query, and a similarity measurement (maybe using cosine similarity) between the query and each document.

This article describes one aspect of a full system developed by the Commercial Software Engineering (CSE) team at Microsoft, together with developers and researchers from University College London’s EPPI Centre. UCL EPPI Centre develops intelligent tools for Systematic Literature Reviews, used by researchers who conduct analyses in many public-health related domains, to come up with evidence-based suggestions for policymakers. Our main objective in this project was to keep reviews up to date by proposing new scientific papers to review owners. For more information on the project, see this blog post.

Problem formulation

In our setting, each review holds a set of papers already collected, and we wish to propose relevant new papers that the researchers might be interested in reviewing and including in their reviews. More formally, the problem is the following:

Let’s say we have three reviews and a stream of new papers. We want to compare each new paper to each review to come up with a similarity measure and a recommendation:

In our production setting, we are likely to get between 100,000 to 1.5 million new papers to score. The requirement is to make this process run for no more than a couple hours.

Modeling approach

Our dataset was comprised of title and abstract per paper. One of the most accurate approaches we looked into was TF/IDF on the paper level in each review. In this setting, we compare the new paper’s TF/IDF vector to TF/IDF vectors of all or some of the papers in each review. We then aggregate the paper-to-paper similarities to get a similarity metric between the new paper and an entire review. For example, we could look at mean similarity to the 10 closest neighbors. This approach turned out to be more accurate than methods like Universal Sentence Encoder, sciBERT, SIF, FastText and others.

For pre-processing, we used spaCy to perform actions like tokenization, removal of stop words and special characters, and lemmatization. See this gist for more details.

Inference runtime improvements

The pseudo-code for inference is the following:

When the number of reviews is large, and the number of papers in each review is also large, this process can take a lot of time. In fact, the first time we ran the model as it was developed during research, it took 16 seconds on average to compare one test sample to all reviews, which translates to two weeks of processing for the number of test papers we have.

We had to come up with a way to improve it. We first highlighted our options:

Faster pre-processing
Improvements to TF/IDF vectorization
Sampling papers from each review
Reducing the number of candidate test papers by using a rule-based logic that detects irrelevant papers (for instance, using other paper attributes like field of study)
Cosine distance calculation improvements (implementation and input representation)
Distributed inference (using Spark)

As it turns out, significant improvement in the runtime process could be realized. After evaluating and benchmarking our system, we concluded that improvements with the most impact would come from representing the input matrices differently and selecting the right cosine distance implementation (point 5). Here’s how!

Cosine distance implementation

We looked at two main implementations: The scikit-learn cosine-similarity and the scipy cdist. There are more, but these two are interesting from two main perspectives:

Data representation: In sklearn, the cosine-similarity method can accept a sparse matrix (csr matrix) that is comprised of a set of input vectors. Sparsity should be an advantage if the dataset is large. cdist, on the other hand, accepts ndarrays only.
Implementation: The implementation in both packages is different. The cdist method is implemented entirely in C and cosine-similarity uses numpy operators.

Which one is better? The answer isn’t that obvious and relies on input size and representation.

Input representation

Cosine similarity is essentially a normalized dot product. So one question is how each input matrix is represented. In our setting, there are three main options:

Compare each input vector (test paper TF/IDF) to each vector of each paper in each review.
Compare each input vector to a matrix comprised of one review’s papers vectors.
Compare each input vector to a global matrix comprised of all papers in all reviews.

This figure outlines the difference among the three configurations:

Three different ways to calculate cosine distance — A new test paper could be compared to each paper from each review (1), to all papers in each review (2), or to all papers in all reviews (3). In each configuration, the cosine distance calculations takes place a different number of times on different matrix sizes.

In order to examine which configuration would be the best, we created a short python script that fits the TF/IDF vectorizer on the full training set and performs inference on a relatively small number of test samples. We then ran it on about 20 different configurations and calculated the time it took on average for one test sample to be processed.

The following graph shows the average runtime for different configurations:

It turns out that searching for the optimal configuration is not a linear process. Here’s a detailed description per development iteration:

We started with the sklearn implementation with preprocessing happening for each class, as this is how the model was initially developed.
In iteration 2 we made pre-processing run once per sample.
Iteration 3: We changed the cosine calculation to scipy (about two seconds instead of about 12 between iterations 2 and 3).
Iteration 4: By creating a review matrix, we improved the runtime to about 1.5 seconds, but that wasn’t good enough.
The same configuration with the sklearn implementation (iteration 5) provided worse results.
In iteration 6 we implemented a global matrix (all training samples in one big matrix) using cdist as the similarity function.
In iteration 7 we returned to sklearn’s implementation, even though it provided worse results on all previous configurations. This time it turned out to be extremely efficient compared to previous iterations.
In iteration 8 we introduced some rule-based pre-filtering using field-of-study values to reduce the number of candidates.

In conclusion, from step 1 to 8 we reduced 99.96% of the run-time per sample, even before leveraging parallel processing. Utilizing a global matrix was more or less the same for scipy (iterations 4 and 6), but much faster for sklearn (0.059 seconds per request, iteration 7). In this global matrix configuration (iterations 6, 7, 8), there is only one vector X matrix multiplication for each test sample.

Our analysis shows that the sparse representation, when the matrix is large, was responsible for the major improvement in runtime on iteration 7 when calculating cosine similarity between a test vector and a matrix comprised of all samples in all classes. When the matrix was smaller (due to a smaller data or the matrix-per-review configuration), the overhead of handling sparsity caused this implementation to be slower than the scipy option.

Additional changes we looked into were parameters to the TfidfVectorizer (like num_features to reduce the vector size), the sample size from each class, and whether to apply rule based pre-filtering.

Tools

Profiling

In addition to multiple runs of each configuration, profiling is an important aspect of runtime improvements. We have used SnakeViz, an awesome Python profiling visualizer.

Measuring time

We found this simple context manager as a neat way of measuring time:

Conclusions

Performance improvements are always tricky. The recipe of creating a small representative data example and evaluating multiple configurations worked for us. As these improvement steps are usually non-linear, changing one thing at a time might make you miss the optimal configuration.

Because we’re dealing with probabilistic methods, the accuracy of the model in hand should also be taken into account as there are obvious trade-offs between runtime and classification accuracy. In our experiments we always verified that the model’s accuracy was not reduced due to the changes in the model’s logic.

These performance improvements, in addition to leveraging Spark for parallel processing, reduced the runtime of the model on 500,000 test samples from days to less than two hours. Goal reached!

Special thanks to James Thomas, Paolo Tenti, Katya Mustafina, Nava Vaisman Levy and Ehsan Zare Borzeshi.

Want to know more about how you can set up your Machine Learning projects for success? Read our recent article:

Setting up Machine Learning projects for success

A tried-and-tested guide to framing data science projects

medium.com

Omri Mendels is on LinkedIn.