From 0 to Warp Speed

An Examination of Optimization of Numpy’s Linear Algebra Functionalities

Bobby Grayson
3 min readJul 16, 2014

An interesting problem had come up recently. There was a specific operation we were running on a very large, sparse matrix. It was roughly 250 million entries, maybe 10 to 12 of which were nonzero values when it took input. The operation being performed did a dot product of a two dimensional and one dimensional matrix before the rest of the algorithm was performed.

When we originally ran this, the time it took was about 750-800 milliseconds. This was acceptable for small datasets, but darnit we’re calculating this on the fly now, and the dataset being processed grows further and further as we get more data (who’da thunk?).

So we put on our optimization hats. Time to get dirty and dive into library code and figure out why this dot product was not going as fast as we like. It turns out that numpy’s usage of the ‘dot’ method was C-Contiguous (C methods), not F-Contiguous (Fortran methods).

This nugget of information seems like it may be a very small, easily grabbed one. However, I kid you not, this next piece was found digging through a late-90's mailing list of SciPy contributors, and later was talked on again in relation to the BLAS libraries Numpy is built on. It was reasonably documented online for dgemm (the function for like dimensionality matrices) but there were no musings of dgemv (for unlike dimensional matrices) to be found whatsoever until going down this deep (that were worth a damn, atleast). However, the issue of C-contiguousness was one that had struck others. I had found my brothers in arms and began reading.

SO! We have found our bottleneck. But now, how in the bloody hell do we solve it? Currently, there was plenty of StackOverflow and Blogosphere discussion about the dgemm method of SciPy as an alternative, which is F-Contiguous. However, in our operation due to the difference in dimensionality of the matrices a transposition would be necessary, and then the result would need to use dgemv instead of dgemm since they are not like in dimensionality, which is what we dug into on the mailing list. Sweet, lets perform that!

BUT WAIT!

This transposition results in a C-Contiguous array! Oh no. We have stepped backwards. The solution, it turns out, is built into Numpy quite nicely and required little more than some solid reading (who doesn’t love reading terse documentation).

The functionality in question is this:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.asfortranarray.html

If after the transposition we used this method to return its state to an F-Contiguous matrix it would be able to be run in dgemv at blazing speeds. Huzzah!

Final Benchmark: From 750-800ms to 750-800 MICROseconds.

Hows that for a speedup?

Enjoy doing linear algebra 1000 times faster with a somewhat trivial optimization (if you don’t have to do all the research first, at least ;) )

Happy Hacking.

--

--

Bobby Grayson

Semi-nomadic developer and attempted comic/sailor who loves OSS.