Even faster string comparison!

Stéphane Collot
inganalytics.com/inganalytics
9 min readJun 18, 2021

New version of sparse_dot_topn 0.3.1 released

TL;DR

ING Wholesale Banking Advanced Analytics developed and open-sourced an algorithm — called sparse_dot_topn — helping to perform large-scale string comparison via an optimisation of sparse matrix multiplication. We use this package internally to do name matching to join different datasets.
We are excited to announce the new version 0.3.1 of the package!

In summary, this version:

  • from 1.5 to 4.0 times as fast as before, see details below
  • uses considerably less memory (~38%)
  • computes the top-n value necessary to capture all results above a given minimum threshold of the scalar product

Thanks to these optimisations, we can run name matching faster with less memory allowing us to run spelling mistake-proof matching.

Why do we need this at ING?

At ING we are trying to compare and match a lot of names — a process called string matching. In general, string matching is useful when you’re trying to join two datasets, and there is no exact key to join on. For example, at ING we are obliged to monitor the names on all transactions to see if these match with international watch lists. Or, and that’s how we use string matching here, we look at names on transactions to non-ING accounts, and try to match these with known (Dutch) organisations, to get a holistic view of our clients.

Our “ground truth” is a list of about 10 million unique company names that we are trying to match other names against. The list of “names to match” contains about 20 million unique names. Each of these names we try to match to a single name in the ground truth. So string matching is a quadratic process, resulting in 10M x 20M = 2.0 x 10^14 name comparisons. That’s A LOT of comparisons: if one check would take a second, the full set would take more than 6 million years to go through. Clearly, the vast majority of these comparisons are irrelevant, but ideally one scans through all of them to find the best match.

This is where sparse_dot_topn comes in, it helps us to do string comparisons very quickly and save only the best matches. When names are split into words and compared word-by-word, sparse_dot_topn can do 2.0 x 10^14 name comparisons in only a couple of hours on our distributed Spark cluster!

sparse_dot_topn was developed by ING’s Wholesale Banking Advanced Analytics department to help solve this type of large-scale name matching problem at the bank, and was open-sourced as a Python package a couple of years ago. A detailed technical description of how sparse_dot_topn works can be found in a previous Medium post.

So why do we need this optimised version, what limitation does it solve? Let us explain that below.

What is sparse_dot_topn?

sparse_dot_topn is a Python package to compute the product of two large sparse matrices and save only the top-n vector pairs.

It is a very helpful computation tool for Natural Language Processing to find similar strings based on cosine similarity for vectorised strings. Remember that cosine similarity is the dot product of the normalised matrices:

The link between cosine similarity and the dot product

For example, A could be a list of names that you want to match with another list of names B. You want to find for each name in A the top-n most similar names in B. First, we vectorise all names in both lists via a classical preprocessing pipeline: word or n-gram tokenisation, followed by TF-IDF. Each name is then represented by a vector of length K, the number of tokens in the full dataset. Since names contain only a few tokens (i.e. few words or few n-gram), this is a very sparse matrix.

So we have two sparse matrices A of size (M,K) and B of size (N,K) where K is the dimension of the features. The dot product A.B is equal to a very large matrix C of size (M,N). Notice the necessary quadratic computation of M*N vector multiplication. But we are interested only in the top-n most similar dot products. That is exactly what sparse_dot_topn is doing. The two crucial optimisations that we are doing directly in C++:

  • First, we do the computation only for indices with a value in both vectors. Meaning the more sparse the two matrices, the quicker sparse_dot_topn gets: it scales with the number of non-null elements in the matrices.
  • Second, we keep only the top-n, so we have in memory during all the execution at most M*n elements, each element consisting of: an index of matrix B and its scalar product value.

How big is the matrix in our case?

The size of the matrix depends on the tokeniser.

We can tokenise company names per “word”. Or we can tokenise names into pieces of n characters, so-called n-grams. This is particularly useful to protect against typos in names, when words will not match up but most of the n-grams will, allowing us to find correctly matching name pairs after all.

Example of string comparison between two company names for word-based and character-based tokenisation

In practice, word-based and character-based tokenisation are complementary techniques, and ideally, you want to use both to find matching names. Word-based matching works fast but will not match names with typos in them, for this we prefer 2-gram based name-matching. However, that results in matrices that are no longer sparse (enough), leading to an unworkable computing time and memory consumption.

Below is a summary of one of our use cases for different tokenisers. As you can see, going from word to character tokenisation really increased the memory size. And that is how we got the idea to use a small float type, and to make sparse_dot_topn work natively with float32.

Characteristics of matrix B for different tokeniser

Under the hood of the upgrade

The latest version of sparse_dot_topn, version 0.3.1, is at least twice as fast and uses considerably less memory. This comes from using a smaller data type for floating numbers and a better memory allocation. We also use multi-core processing techniques more efficiently, as available in most CPUs these days.

Even though sparse_dot_topn is a Python module, it is in reality a wrapper around a compiled function written in C++ which does all the computations. We used Cython to bridge the gap between Python and C++. This enables users to seamlessly incorporate sparse_dot_topn into their Python programs, while also taking advantage of the superior execution speed of C++ which is often several orders of magnitude higher than Python.

In much more technical detail

We actually released two new versions:

Version 0.3.0 contains this pull request:

  • We removed the restriction in the previous version that the available memory before matrix multiplication should at least be able to fit nnz_max elements of the result-matrix, where nnz_max = ntop × number of rows of the result-matrix. For very large and very sparse matrices, this value of nnz_max tended to be a gross overestimate for the size of the result, leading to an out-of-memory exception (MemoryError). This exception preempted the matrix multiplication from executing even though there may have been enough memory available to produce the result. On the other hand, in some other cases, removing this restriction posed the risk of not allocating enough memory to store the result-matrix. To circumvent this risk,
  • We also introduced a Cython extension (array_wrappers.pyx/.pxd) that enables C++ to reallocate memory for the output during the matrix multiplication whenever Python’s memory allocation size estimate before the multiplication turns out to be too small.
  • We further auto-parallelized some previously serial code-sections in the multithreaded functions — in particular, the construction of the compressed sparse row-format (CSR) output matrix.
  • We defragmented the memory used during the threaded computation by the temporary STL vector of vectors std::vector<std::vector<candidate>>. Previously, its length was equal to the number of rows of the result-matrix. Now, its length is the (generally much smaller) number of threads.
  • We made some prudent STL vector reservations std::vector::reserve in the threaded version to reduce the incidence of memory reallocations.
  • We can compute, if requested, best_ntop: the minimum value of ntop needed to capture all results above the threshold lower_bound.

Version 0.3.1 contains this pull request:

  • We provide the option to use the smaller float data type, numpy.float32 (aka float), instead of the default numpy.float64 (aka double), using C++ templating. Tests have shown that despite the expected loss of internal numerical precision, using float32 instead of float64 in various NLP applications employing sparse_dot_topn does not produce any significant change in the final results. But as can be seen below, the memory usage drops significantly and the matrix calculation is much faster.

Benchmarking

We benchmarked the different versions, to measure memory footprint and execution time in the two section below, with 16 random matrix-pairs A and B with the following characteristics:

ntop = 4 000
lower_bound = 0.01
nr_vocab = 26^3 = 17 576 (the number of columns of A = number of rows of B)
density = 30/n_vocab = 0.0017
n_samples = 1 000 000 (the number of rows of A)
n_duplicates = 4 000 (the number of columns of B)
nnz(A) = 30 000 000 (the number of nonzero components of A)
nnz(B) = 120 000 (the number of nonzero components of B)
nnz(A*B) = 188 446 798 ± 44 679 (mean and standard deviation of the total number of nonzero components of A*B above the threshold over 16 random matrix-pairs A and B)ntop(A*B) = 391 ± 15 (mean and standard deviation of the true maximum number of nonzero components per row of A*B above the threshold over 16 random matrix-pairs A and B)

Memory profiling

Below are some graphs below to compare the memory used in function of time. Those plots have been generated with python package memory_profiler

The beginning of the graphs shows the allocation of the two random matrices A and B. And then you see in between brackets [ ] of different colours the computation of sparse_dot_topn for different numbers of threads.

Below we can see the improvement due to the better memory allocation between the old version 0.2.9 and version 0.3.0. We can notice that we got rid of one allocation spike at the beginning of each run, and that is faster, dropping from 116 seconds down to 84 seconds:

Memory profiling version 0.2.9-float64 with 2 allocation spikes
Memory profiling version 0.3.0-float64 gives a smoother allocation and faster than above

Below we can compare memory usage when using float32. We can notice that we went from 8000 MiB to 5000 MiB in this benchmark, a drop of 38%. In addition, the timing dropped further down to 63 seconds:

Memory profiling version 0.3.1-float32 gives a smaller memory footprint and faster than above

Timing profiling

Here is below a comparison of the average execution time for the different versions over the 16 random matrix-pairs A and B.

Remark:
- Number of threads = 0 means that we are using the single-threaded source code.
- Number of threads = 1 means that we are using the multi-threaded source code but with only 1 process.

Here is the table in a more visual bar chart. As one can see, in this test the new version of sparse_dot_topn is 1.5 to 4.0 times as fast than before:

Conclusion

The new version of sparse_dot_topn is considerably faster and uses far less memory, allowing for even larger scale string comparisons that are typo-proof. We invite you to try it out and are happy to hear your feedback!

https://github.com/ing-bank/sparse_dot_topn

pip install -U sparse_dot_topn

Credits

Coauthor:

Contributors to the code:

Reviewers of the code:

We are also happy to promote a package that is soon using ours:
string_grouper (maintainer: Chris van den Berg) is a library that makes finding groups of similar strings within a single, or multiple, lists of strings easy — and fast.

--

--