TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Vectorizing computations on pairs of elements in an nd-array

4 min readOct 1, 2021

--

Graphic by Author

Consider a very simple contrived problem below. You have an array of numbers:

import numpy as np
a = np.array([0, 10, -3, 5, 7, 20, -9])

and you want to compute the mean absolute difference between each pair of numbers. Let n be the number of elements in a. Then the number of pairs is n(n-1)/2. So a simple approach would be to run over all possible pairs, compute the absolute difference for each pair and then average them. Here is the code:

If we run it on the array above, we get:

>>> mean_absolute_difference(a)
11.428571428571429

The presence of two for loops in this code reeks of inefficiency. Ideally, we should avoid Python for loops in numerical code as much as possible by vectorizing the algorithm. How do we do that?

The key challenge is to figure out a way to get every pair of elements in a without using any for loop. Can NumPy assist us with it? Yes indeed. Thanks to a beautiful function triu_indices.

Let’s see what it does:

>>> i, j = np.triu_indices(7, k=1)
>>> print(np.row_stack((i, j)))
[[0 0 0 0 0 0 1 1 1 1 1 2 2 2 2 3 3 3 4 4 5]
[1 2 3 4 5 6 2 3 4 5 6 3 4 5 6 4 5 6 5 6 6]]

If we carefully look at the output, we see that each column gives the indices for each pair of elements in a. In total there are 21 such pairs.

Let’s now vectorize the mean absolute difference function.

The for loops are gone. We are forming exactly n(n-1)/2 pairs so no redundant computation.

I compared the performance of the two versions on a MacBook.

a = np.random.randn(1000)
%timeit mean_absolute_difference(a)
500 ms ± 1.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vectorized_mad(a)
8.78 ms ± 42.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We can see a significant jump in performance.

While it might be possible to come up with a smarter and faster solution for this specific problem, my goal here is to illustrate the technique for building pairs of elements and performing some computation on each pair.

Pairs in an array of vectors

Let’s take a more challenging example of computing pairwise euclidean distances among an array of vectors:

This performs the exact same computation as pdist function in SciPy for the Euclidean metric.

a = np.random.randn(100, 3)
from scipy.spatial.distance import pdist
assert np.allclose(pdist(a, 'euclidean'), pairwise_distance(a))

The SciPy version is indeed faster as it has been written in C/C++. However, our pure Python vectorized version is not bad (especially for small arrays).

What if you wanted the result of pairwise distance in the form of a matrix d such that d[i, j] denotes the distance between i-th and j-th vector? Here it goes:

Here is an example use on an array of 3 vectors where each vector is from R⁵.

>>> a = np.arange(15).reshape((3,5))
>>> pairwise_distance2(a)
array([[ 0. , 11.18033989, 22.36067977],
[11.18033989, 0. , 11.18033989],
[22.36067977, 11.18033989, 0. ]])

Here are some salient points of this implementation:

  • The diagonal elements of resultneed not be computed as we know that distance of a point with itself is 0.
  • Since Euclidean distance is symmetric, we only need to compute the upper triangular part of the resultand replicate the same in the lower triangular part.
  • The indices i,j computed in the beginning were reused later to place the vector of distances back into the resultmatrix.

Pairs in an array of matrices

We can go on and do computations on pairs of matrices in an array of matrices too. Here is a somewhat sophisticated example with a reasonably complex computation over pairs of matrices.

A Grassmannian is a space that parametrizes all k-dimensional subspaces of an n-dimensional vector space. Such subspaces are also known as k-flats. Each flat can be represented by an orthonormal basis (ONB). an ONB for a k-flat is nothing but an (n, k) sized unitary matrix. The concept of angles between lines in a plane can be generalized to angles between flats. These angles are called principal angles.

It turns out that computing the principal angles is not very hard. If you have the ONBs A and B for the two subspaces, then you just need to compute their product M = A^H B and find its singular values. The principal angles are given by the arc cosine of the singular values. In particular, the smallest principal angle between two such subspaces is given by the arc cosine of the largest singular value of M.

Now let’s say that you have an array of ONBs for a set of n subspaces and you wish to compute the smallest principal angle between each pair of subspaces. Since SVD is an expensive operation, you don’t want to do it for more than n(n-1)/2 pairs. So here is the implementation.

Happy vectorizing!

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Shailesh Kumar
Shailesh Kumar

Written by Shailesh Kumar

Python | JavaScript | Web Applications | Math | Statistics

No responses yet