Yet even faster string comparison!

Max Baak
7 min readMay 26, 2024

--

Much improved version of sparse_dot_topn now released, check out v1.1.1

TL;DR

Our open-source algorithm sparse_dot_topn, that helps to do large-scale string comparison via optimised sparse matrix multiplication, has undergone a major revision. It is now yet even faster, uses less memory, and can multipy even larger matrices. We are excited to announce version 1.1.1 of the package!

This post shares the relevant updates. In summary, this version:

  • Is up to 6 times faster than before;
  • Reserves far less memory in the process;
  • Can handle the multiplication of arbitrarily large sparse matrices by chunking and zipping them.

Why do we need this?

The algorithm sparse_dot_topn forms the core of our recently open-sourced Entity Matching Model package. At ING Wholesale Banking Advanced Analytics we use these packages to do name matching to join very large datasets.

In a nutshell, the problem these solve is to match (company) names between two possibly very large datasets. See the figure for an example. sparse_dot_topn allows us to calculate the cosine similarity values of each external name to all names from the ground truth dataset (grey arrows), and then select the best match (orange arrows).

The Python library sparse_dot_topn was developed and open-sourced seven years ago by ING Wholesale Banking Advanced Analytics to perform large-scale string comparison via optimized sparse matrix multiplication. It is an optimised adaptation of scipy‘s sparse matrix multiplication code, followed by top-n multiplication result selection. We have posted about sparse_dot_topn before; see here for a previous speed update, and here for a detailed technical description of how sparse_dot_topn works.

The release of Python 3.12 forced us to do a major revision of the sparse_dot_top code. We used this opportunity to increase its speed yet even more, reduce the memory consumption, and add the multiplication of even larger sparse matrices by chunking them. Here are the relevant details.

Speed and memory updates

sparse_dot_topn now provides a new (parallelised) sparse matrix multiplication implementation that integrates selecting the top-n values, resulting in a significantly lower memory footprint and improved performance.

Let matrix A have shape n x k, B: k x m, such that C: n x m.

v0 of sparse_dot_topn selected the top-n products for a given row in A in two steps:

  1. collect all columns where product > threshold into a vector,
  2. partial argsort of the collected values.

Because it’s impossible to a-priori determine the number of products that supersede the threshold, we needed to allocate a vector of size m.

Under the reasonable assumption that m >> top-n, it can be that we are storing and sorting many values that are well outside the top-n.

What we need is a data-structure that is:

  • fixed size,
  • cheap to replace the minimum value,
  • cheap to determine the minimum value.

Min-heap

A min-heap is a priority queue that is typically implemented as a binary heap where each node is smaller or equal to its descendants.

It satisfies all our requirements and with a memory footprint of only the
top n+1 values, it is much more efficient.

Inserting a new value consists of two steps:

  • popping minimum value,
  • pushing the new value.

Popping the minimum value entails moving the minimum, stored at the head of the container, to the back of the container.

The position of the minimum value is now filled with the minimum of its two children, whose position is filled with the minimum of its two children, etc …

Pushing onto a heap follows a similar procedure illustrated below.

Reconstructing the min-heap following a pop-push has a complexity of
O(n log n) and the minimum detection happens in constant time.

The new algorithm consists of three steps for a given row in A:

1. fill the min-heap with max values given type T,
2. if product > min_heap:
1. pop heap,
2. push product onto heap,
3. min_heap = min(heap).

Given that the vast majority of products will be smaller than the minimum heap value, few products will have to be pushed onto to the heap.

Note that this data structure has worse worst-case time complexity than the previous approach but increasingly better time complexity as the ratio
top-n / m becomes smaller. Additionally the worst-case scenario is quite unusual in that top-n has to be equal to m and the columns of B have to be ordered in such a way that the products are sorted in ascending order.

Performance evaluation of the two methods confirms that in-practice the tipping point is a top-n between 100–1000 for a value of m of 100k. The new approach is on average 1.2 times faster (median: 1.06), with a best case of 2.4 (top-n = 10), and worst case of -1.2 (top-n = 1000).

Parallelism

Sparse matrix multiplication in the compressed row format (CSR) is embarrassingly parallel as the inner product of each row in A can be computed independently. In practice, however, parallelism is typically limited by memory pressure rather than a lack of compute.

The lower memory footprint of the max-heap enables more aggressive threading. The multi-threaded implementation has been rewritten using OpenMP and now avoids a costly try-except memory allocation procedure that was required by the previous approach.

As a result performance now scales much closer to linearly as the number of cores increases than before.

Combined performance

On Apple M2 Pro over two 20k x 193k TF-IDF matrices sparse_dot_topn can be up to 6 times faster (!) compared with the old version (v0.3.6) when retaining the top-10 values per row and utilising 8 cores.

Distributed sparse matrix multiplication

The top-n multiplication of two large O(10M+) sparse matrices can now be easily distributed over a cluster, by breaking down the sparse matrices be into smaller chunks.

For example, one may want to split sparse matrices into matrices with just 1M rows, and do the the (top-n) multiplication of all those matrix pairs separately. Reasons to do this are to reduce the memory footprint of each pair, and to employ available distributed computing power.

The pairs can be distributed and calculated over a cluster (eg. we ourselves use a spark cluster). The resulting matrix-products are then zipped and stacked in order to reproduce the full matrix product.

Here’s an example how to do this, where we are matching 1000 rows in sparse matrix A against 600 rows in sparse matrix B, and both A and B are split into chunks.

import numpy as np
import scipy.sparse as sparse
from sparse_dot_topn import sp_matmul_topn, zip_sp_matmul_topn

# 1a. Example matching 1000 rows in sparse matrix A against 600 rows in sparse matrix B.
A = sparse.random(1000, 2000, density=0.1, format="csr", dtype=np.float32, random_state=rng)
B = sparse.random(600, 2000, density=0.1, format="csr", dtype=np.float32, random_state=rng)

# 1b. Reference full matrix product with top-n
C_ref = sp_matmul_topn(A, B.T, top_n=10, threshold=0.01, sort=True)

# 2a. Split the sparse matrices. Here A is split into three parts, and B into five parts.
As = [A[i*200:(i+1)*200] for i in range(5)]
Bs = [B[:100], B[100:300], B[300:]]

# 2b. Perform the top-n multiplication of all sub-matrix pairs, here in a double loop.
# E.g. all sub-matrix pairs could be distributed over a cluster and multiplied there.
Cs = [[sp_matmul_topn(Aj, Bi.T, top_n=10, threshold=0.01, sort=True) for Bi in Bs] for Aj in As]

# 2c. top-n zipping of the C-matrices, done over the index of the B sub-matrices.
Czip = [zip_sp_matmul_topn(top_n=10, C_mats=Cis) for Cis in Cs]

# 2d. stacking over zipped C-matrices, done over the index of the A sub-matrices
# The resulting matrix C equals C_ref.
C = sparse.vstack(Czip, dtype=np.float32)

Step 2a splits the matrices A and B into chunks. Step 2b performs the top-n multiplication of all sub-matrix pairs. In practice all sub-matrix pairs could be distributed over a cluster and multiplied there. Steps 2c and 2d zip and stack the sub-matrix product respectively, resulting in the full matrix product.

Migrating to v1

sparse_dot_topn v1 is a significant change from v0.* with a new bindings and API. For users of the old version, detailed instructions on how to switch to this new version are provided here.

Conclusion

This new version of sparse_dot_topn is significantly faster and uses less memory, and adds the feature of chunked sparse matrix multiplication. Together these allow for even larger scale string comparisons. We invite you to try it out and are happy to hear your feedback!

https://github.com/ing-bank/sparse_dot_topn

pip install -U sparse_dot_topn

Credits

Coauthor: Ralph Urlus

The code updates were authored by ING Analytics Wholesale Banking, in particular by Ralph Urlus and Max Baak.

Thanks to Nikoletta Bozika for reviewing this blog.

--

--

Max Baak

Data scientist, Researcher, former Particle Physicist