Effective Array Computation in Haskell

Because we all want highly performant and beautiful code

Samuel Schlesinger
The Agile Monkeys’ Journey

--

The purpose of this blog post is to explain my results in researching a linear algebra interface in Haskell. This research was done with the purpose of instructing us at theam where to search for high performance linear algebra operations and give us a sketch of the solution space in this field. This is by no means a complete investigation and I am fully open to the idea that I may have misjudged some of the problems I identify, in fact that’s what I hope to provoke by putting this out there, as that would simply mean that this problem is simpler to solve than I had imagined.

It is important to recognize that the work that was done in the past is not necessarily perfect, but when something has been worked on so much like this, it is likely that the past approaches are going to have pieces of the whole solution so that we can try to take each good part back into ours and use it. In light of this, contained are the positive and negative results in this regard, and hopefully it will be clear that there is no clear answer contained here, just some preliminary observations and ideas for where to go next.

Some Past Research

When talking about efficient linear algebra algorithms, one often talks about Basic Linear Algebra Programs (BLAS), and in a sense this set of complementary interfaces have become synonymous due to the amount of hours which have been put into optimizing the various routines and algorithms which make up the BLAS technical interface. The original idea of BLAS is that there are a small set of operations which can be heavily optimized using knowledge about modern computers, and most other useful programs can be written in these operations. These original operations are called level 1 [2], and consisted of vector operations like summation, norm, maximum, dot product, and more. These operations are optimized primarily using vector processors. As time went on and multiprocessors became very important, the level 2 [5] interface, which semantically could be implemented in terms of level 1, was created to do matrix-vector operations such as matrix-vector products, updates, solving linear systems. Finally, level 3 was created [6], taking advantage of modern memory architectures to implement matrix-matrix operations such as matrix multiplicaton and updates. All of these optimizations must be taken advantage of in any serious competitor for systems based on BLAS, or a BLAS implementation used under the hood. One design principle which I found particularly inspiring was the idea of denotational parsimony, as once you understand the nth level of operations you can understand the (n + 1)th by the implementation of the (n + 1)th in the nth level’s primitives, even though the (n + 1)th are implemented in efficient ways that you don’t have to understand as a user.

One design point in the implementation space of BLAS is called Automatically Tuned Linear Algebra Software (ATLAS), which uses a method called Automated Empirical Optimization of Software. Essentially, a search is done upon installation for a number of performance related parameters, the justification for this being that a number of the optimizations one applies essentially will only work for some parameter which relies upon the users machine. If one’s software can find these parameters for themselves instead of relying on the user to specify what the best ones are, the programmer can write software which may perform quickly even on hardwares to come, as the constants will get changed around to suit the machine the system is compiled for. The requirements to support this software method are stated in [7] and are as follows:

  1. Isolation of performance-critical routines
  2. A method of adapting software Current Approaches to differing environments
  3. Robust, context-sensitive timers
  4. Appropriate search heuristic

In the subsequent section, the authors explain how in the simplest case, where one is simply choosing constants for different architectures using C macros or some form of conditional compilation, this is simply automating what one might do by hand [7].

Eigen is a C++ library which allows for much better notation than BLAS does, in that its interface is abstracted away from the memory model and one can simply write code such as:

Matrix2d x = a * (b + c) - d * a;

The beauty of Eigen is that the above code will get translated into a form where the operations on the right hand side of the assignment won’t each allocate a new matrix, even though that is their semantics. This is possible in C++ in a relatively straightforward way, but as we’ll see in Haskell we must put a little more effort into fusion as we don’t have such a glorified notion of assignment.

Some Current Approaches

In order to talk about the current approaches to this problem, we must understand what we want to do. The purpose of this document is to explore the solution space for efficient numerical computation in Haskell, and as such we need to explore three areas:

  1. Central Processing Units (CPUs)
  2. Graphical Processing Units (GPUs)
  3. Computing Clusters

Each paradigm listed above have their uses and they can be used in tandem for greater effect, but the one which offers the highest throughput for operations such as the ones we’re after is the GPU, as hundreds or thousands of cores is hard to beat with eight or sixteen. That being said, I would say that an ideal situation is one in which we’ve stayed agnostic of this choice, which is why I looked into domain specific languages.

The first solution one comes across for numerical computing in Haskell is the vector library. Vector is, in it’s own words, an efficient implementation of Int-indexed arrays (both mutable and immutable), with a powerful loop optimization framework. The package is structured in terms of a Vector class and instantiations of the class at unboxed, externally stored, boxed, as well as primitive vectors. This library can be used for numerical computation and is a suitable solution for many simpler problems, especially given that the statistics library offers a lot of functionality on top of it that can be used for computing various facts about your data. Here I find the Pearson correlation coefficient between two populations of samples:

import Data.Vector.Generic
import Statistics.Correlation
import DataFetchingLib (getData)
main = do
xs <- getData "Population1"
ys <- getData "Population2"
print (pearson (zip xs ys))

Here is another example using the math-functions package, which is depended on by statistics.

import GHC.Exts (fromList)
import Data.Vector.Generic (Vector(..))
import qualified Data.Vector.Unboxed as Unboxed
import Numeric.RootFinding
import Numeric.Polynomial
import Numeric.MathFunctions.Constants (m_epsilon)
wasFound :: Root a -> Bool
wasFound (Root _) = True
wasFound _ = False
main :: IO ()
main = do
-- f(x) = 1 + x^1 + x^2 + ... + x^100
let coeffs :: Unboxed.Vector Double = fromList [x | x <- [1..100]]
let poly x = evaluatePolynomial x coeffs
let roots = [ridders (m_epsilon * 4) (x, x + 0.2) poly
| x <- [-1000.0, -999.8 .. 1000.0]]
print (wasFound `filter` roots)

Another solution with some clear advantages is hmatrix, which simply exposes a bunch of BLAS/LAPACK backed operations on dense matrices. One great advantage of this library is the Numeric.LinearAlgebra.Static package, which offers statically sized vectors and matrices so you can rule out a large class of bugs while working on these kinds of programs. This library is a clear win if you have a bunch of dense linear algebra to do in your application and you can afford to do it on a CPU.

Another solution I reviewed, much more suitable to arbitrary data analysis, is repa, and was written as the result of [9]. The library, again in its own words, provides high performance, regular, multi-dimensional, shape polymorphic parallel arrays, ensuring that numeric data is stored unboxed. It features a parallelization framework which is essentially automatic and for that reason can produce incredibly efficient code. There is a GHC plugin, repa-plugin, which achieves data flow fusion, which is essentially what we liked from the Eigen library, the ability to not allocate so much memory but still use convenient notation. This is achieved in repa alone as well by utilizing a delayed representation alongside normal array representation, as this allows you to defer the construction of arrays to the consumer of whatever nice array you’ve designed in a delayed constructor. This library seems to have a specialization on image processing and has features which will support this very well as such.

The last solution I looked at was accelerate, a domain specific language (DSL) for high performance computing in Haskell which aims to make itself well suited for GPU execution [10]. It uses many of the same techniques as repa, but it is suited for execution in a great many contexts, and accelerate-llvm offers multicore as well as GPU backends. The clear advantage to accelerate is its lack of a context: it is a deep embedding where repa is a shallow one. This means that, whereas repa arrays are represented directly by their semantics, accelerate computations are kept in abstract syntax tree (AST) form, which means they can be run in a number of contexts. To me, this feels like a great advantage, as one could write some data analysis code and do some static analysis on it as well as run it in a very convenient way.

Discussion

First, I looked at the old solutions in C, Fortran, and C++ and what people had been using for ages to solve these problems, then I looked at the modern Haskell solutions and tried to reason about what has changed in order to make these new solutions more desirable now, and how we can possibly improve upon them or use them as they are. The first thing I noted is that in BLAS, which really still is underneath a lot of the linear algebra you’re going to want to do on a computer, was written in a time when the constraints were much different. The original paper on BLAS was
published in the 70s when memory could be measured in kilobytes or megabytes, whereas now it’s measured in gigabytes and terabytes. It’s clear that the model has changed and that there is more room to code at a higher level, but how much room is there?

To solve this problem, I looked at Eigen, which is a classic and well documented solution in C++ for numerical programming, and frankly works quite well if you’re looking for something in C++. Something which is important about using a high level language for programming array computations is that you want your programming language’s notation to closely resemble that of mathematics. One can do this quite well in Eigen, so much so that we can produce the following code:

#include <iostream>
#include <Eigen/Dense>
using namespace Eigen;int main()
{
Matrix2d a, b, c;
a << 1, 2,
3, 4;
b << 2, 3,
1, 4;
c = a + b - a - b + a;
}

This code is very readable, and according to the Eigen documentation, should compile down to code which only allocates three matrices. This is the sort of thing we need to mimic in whatever solution we choose, as we must forget about memory to some extent if we want to work at a high level, but we certainly want, behind the scenes, for these things to get at least a bit optimized. The problem, of course, in Haskell, is that there is no glorified notion of assignment as there is here, and we instead have to deal in pure values and functions. The solution, as professed in repa and accelerate, is to adapt some sort of DSL which hides the complexity of this behind you. In repa, this is done via a “delayed” matrix representation, represented by a function from index to value, so that you can defer the construction of complex matrices to the consumer of the matrix. The route that accelerate takes is a lot simpler and more extensible in my opinion, essentially just using the AST behind the scenes to do reasonable time and space optimizations.

When talking about the memory and time considerations we must have, it is easy to think that these constraints can be solved with experiments on one machine, but it is clear to me that this is no longer the case. Methods from algorithm design often design a single algorithm for a single machine, but ATLAS was a little different, it had a space of implementations which was known to contain decent implementations on some machines, and it was allowed to wiggle around in that space and figure out automatically what it should best do [7]. There are interesting optimizations which can be done based on prediction of cache performance [8], as well as constants available at runtime [9], which maybe we can investigate from a platform such as accelerate or another deeply embedded DSL like it, as we have access to the syntax representation at runtime in this setting. This could be an interesting initial application of machine learning in Haskell and a motivating case for coming investigations into machine learning interfaces.

Goals

I think that, in the short term, it is profitable to pursue accelerate and the set of backends present in accelerate-llvm. One clear benefit is that this is being developed by a member of the DataHaskell community, meaning that the author is interested on expanding this library for data science use cases. This strategy will let us deploy high performing GPU accelerated data analysis onto cloud platforms such as AWS and (soon enough) Google Cloud Platform.

It’s clear that in the long term we’re going to need to establish a map of this territory, and I am not claiming to understand the whole design space by any means. I think that it would be fantastic to have comprehensive and actionable knowledge concerning the performance of vector, hmatrix, repa, accelerate, as well as the other bindings such as to Eigen which have been released for Haskell. That being said accelerate and repa are the only platforms there which support arbitrary tensors. It is clear to me that at the moment accelerate will come out on top of repa if executed on a GPU, though repa is has the advantage that it relies on a shallow embedding and thus it allows GHC Haskell’s optimizer to perform miracles on it, and thus can be mixed with arbitary code with good results, whereas accelerate relies upon a deeper embedding, and thus retains more information when doing source to source translation and compilation down to the various backends.

Something I’m keen to mention as well is that the introduction of linear types to the Haskell programming language as implemented in GHC would give us great ways to eliminate some of the major problems which arise due to memory complexity. This seems to be something which could occur down the road [3]. If someone wants to investigate how this might go in the meantime, I recommend exploring rust.

Sources

0. The repa website: http://repa.ouroborus.net/
1. The accelerate github repository: https://github.com/AccelerateHS/accelerate
2. Basic Linear Algebra Subprograms for Fortran Usage. C.L. Lawson, R.J. Hanson, D.R. Kincaid, F.T. Krogh: http://www.cs.utexas.edu/users/kincaid/blas.pdf
3. GHC Linear Types description: https://ghc.haskell.org/trac/ghc/wiki/LinearTypes
4. repa-plugin: https://hackage.haskell.org/package/repa-plugin
5. An Extended Set of Fortran Basic Linear Algebra Subprograms. Jack J. Dongarra, Jermy Du Croz, Sven Hammarling, Richard J. Hanson.
6. A Set of Level 3 Basic Linear Algebra Subprograms. Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, Iain Duff.
7. Automatically Tuned Linear Algebra Software. R. Clint Whaley, Jack J. Dongarra.
8. Program Optimization Based on Compile-Time Cache Performance Prediction. Wesley K. Kaplow, Boleslaw K. Szymanski.
9. Regular, Shape-polymorphic, Parallel Arrays in Haskell. Gabriele Keller, Manuel M. T. Chakravarty, Roman Leshchinksiy, Simon Peyton Jones, Ben Lippmeir.
10. Optimising Purely Functional GPU Programs. Trevor L. McDonnell.

--

--