
Introduction
This article covers some details about R’s API to C and subsequently C’s API to Fortran which is used to call BLAS routines. In modern use, I think the best way to make explicit use of BLAS is to work via RcppEigen or RcppArmadillo, direct interface to BLAS is complex and error-prone, and likely without significant performance advantage over Eigen or Armadillo. We are doing it the hard way through C as an exploratory exercise, not as a recommended pratice.
What is BLAS?
Basic Linear Algebra Subprograms (BLAS) are a set of functions for performing matrix operations ubiquitous in scientific computation. These functions are defined abstractly and have been implemented by various groups in the form of ATLAS, openBLAS, MKL, and more. Netlib publishes a useful summary of all the BLAS operations.
Why use BLAS?
The advantage is that we call BLAS routines in our code through the canonical interface, and we can then link different implementations to our program which will provide different performance profiles. We can go from a general public distribution, to a proprietary version optimised to our hardware, we can even swap out the implementation for one that operates on the GPU, all without changing our code. Because it’s so widely used in scientific computing, the routines are among the most highly optimised in all of computing. A discussion on stackoverflow details the performance and reason behind the performance of BLAS routines.
How do I use BLAS?
If we are just writing regular R code, chances are we are already using BLAS. Regular operations like matrix multiplication using %*% and useful matrix functions like crossprod() are all implemented using BLAS algorithms, this means that R code can already be very efficient without having to directly interface with BLAS.
If you really want to make sure you’re using BLAS, you should look into RcppEigen or RcppArmadillo, which transform your matrix operations into efficient BLAS calls, it’s likely more efficient than manual BLAS calls. But the point of this article is to show how you would do it in base R, so we are going to call the C interface from R and the Fortran interface from C.
An Aside about Computer Memory
To understand the interface between languages and half the performance equation for calling C code, we need to talk about how computer memory works.
All data on computers are effectively at some memory address along a long 1D strip of memory as something like 0|1|0|1|1|0|1|1, this corresponds with tape storage and we can imagine the data forming one long spiral from the edge of a CD to the center. Matrices are then stored as consecutive columns (in R and Fortran, many other languages use are row-based), so we have column1|column2|column3 to represent a matrix in memory.
The machine keeps track of the data using addresses, it knows variable x lives at address 0x0014AF40 for example, and when we call for x, the machine will read out what it sees that memory address.
R takes care of memory management, it will allocate space in memory for our variables, it will free the memory when we delete our variables. It will make copies of our variables for our functions so changes inside the functions don’t affect our variable. For example
x <- 10f <- function(x) {
x <- x + 2
return(x * x)
}f(x) # returns 144
will not affect the value of x, because its operating on a copy of x. This is strictly enforced in R (with few exceptions such as in data.tables) and it is useful because if we use functions written by others, we don’t need to worry about our variables changing value after running the function.
The C API in R
There are two APIs from R to C, the .C() and .Call() interfaces. The differences are summarised at the bottom of this great article, the modern advice is basically to only use .Call() if possible, it is safer and more efficient. To use C code, the easiest way is to create a package, in this package we need a src folder and to have useDynLib(pkgName) in our NAMESPACE file. Then we can write C functions to be called in R. I’m going to put all my functions in a file called lib.c, and we will end up with the following folder structure.
.
├── DESCRIPTION
├── NAMESPACE
├── R
│ └── main.R
├── man
└── src
└── lib.cTo make use of C functions, we need to write the function in C with special data types, then write a corresponding function in R to call the C function through the .Call() interface.
lib.c
#include <R.h>
#include <Rinternals.h>SEXP add_c(SEXP x, SEXP y) {
SEXP res;
PROTECT(res = allocateVector(REALSXP, 1));
REAL(res)[0] = asReal(x) + asReal(y);
UNPROTECT(1);
return(res);
}
main.R
add_c <- function(x, y) {
.Call("add_c", x, y)
}Then we can call add_c() in R and it will execute the C code to return our result. The thing to note here is that the C code looks very convoluted, and the reason is because it is doing a lot of memory management and manipulation.
The reason for this is because R shares data with C via memory addresses, this is more efficient than outputting the data from R and reading it in from C. The way this works is that R tells C the memory addresses at which its variables are stored, C reads the data from this memory, it (potentially) creates an output variable then communicates back to R the address of the result for R to read.
It’s actually quite expensive to allocate and write memory for large objects, so this scheme offers significant efficiency gains at the cost of more complicated code and extremely opaque errors when memory is mismanaged. A more detailed and concise explanation of the C interface can be found this article, make sure to also follow the links to the source R documentation. Simon Urbanek provides a useful cheatsheet for the C macros.
Calling BLAS from C
Without including any external libraries, the way to call BLAS routines in C with the R headers is to use the Fortran interface. This is really not particularly well documented, we can find the R wrappers for Fortran in the R source code, but there are no explanations for the variables, but they match up with the canonical BLAS interface so it’s possible to look it up there. The way to call these routines is probably best discovered in the R source code. So we can write a simple C function to multiply two matrices using BLAS via the F77_CALL() interface.
blas.c
#include <R.h>
#include <Rinternals.h>SEXP mat_mult_blas(
SEXP A_, SEXP a_rows_,
SEXP B_, SEXP b_cols_, SEXP k_)
{
char * TRANSA = "N";
char * TRANSB = "N"; double * A = REAL(A_);
double * B = REAL(B_); int K = asInteger(k_);
int M = asInteger(a_rows_);
int N = asInteger(b_cols_); int LDA = K;
int LDB = N; double ALPHA = 1.0;
double BETA = 0.0; SEXP result_;
PROTECT(result_ = allocMatrix(REALSXP, M, N));
double * result = REAL(result_); F77_CALL(dgemm)(
TRANSA, TRANSB, &M, &N, &K, &ALPHA,
A, &LDA, B, &LDB, &BETA, result, &N); UNPROTECT(1);
return(result_);
}
blas.R
mat_mult_blas <- function(A, B) {
.Call(
"mat_mult_blas",
A, nrow(A),
B, ncol(B), ncol(A)
)
}When I said simple, I meant a nightmare of keeping track of what is an R address, a C value or a C pointer, I triggered at least two dozen segmentation faults while getting this code to work.
Like from going R to C, going from C to Fortran means passing everything through as memory addresses, unlike going from R to C, this requires manually keeping track of what is an address and what is a value, thus the & and * syntax littered throughout. The result of this is now a function that multiplies two matrices together.
When working with BLAS, we also need to think about a lot of other things we would not think about in regular matrix multiplication, like whether the matrices need to be transposed, the size of the leading dimension (relates to row or column order) and whether to pre-multiply by a scalar.
It’s Not Necessarily Faster Than Base R
For basic matrix multiplication, we saw in the source code that R already uses BLAS for matrix operations, so theres really no reason that my implementation here should make a difference.
> library(microbenchmark)
>
> A <- matrix(rnorm(50^2), nrow = 50)
> B <- matrix(rnorm(50^2), nrow = 50)
>
> microbenchmark(
+ A %*% B,
+ mat_mult_blas(A, B)
+ )
Unit: microseconds
expr ... mean median ...
A %*% B ... 17.50035 17.1635 ...
mat_mult_blas(A, B) ... 20.63844 20.2045 ...In fact I am just a little slower than the base R implementation as I put no effort into optimising my code. However we can compare this to a naive implementation of matrix multiplication using for looks in R.
mat_mult_R_dumb.R
mat_mult_R_dumb <- function(A, B) {
out <- matrix(0, nrow = nrow(A), ncol = ncol(B)) for (i in 1:nrow(out)) {
for (j in 1:nrow(out)) {
for (k in 1:ncol(A)) {
out[i, j] <- out[i, j] + A[i, k] * B[k, j]
}
}
} out
}
Now we see just how fast BLAS is
> microbenchmark(
+ A %*% B,
+ mat_mult_blas(A, B),
+ mat_mult_R_dumb(A, B)
+ )
Unit: microseconds
expr ... mean median ...
A %*% B ... 17.50035 17.1635 ...
mat_mult_blas(A, B) ... 20.63844 20.2045 ...
mat_mult_R_dumb(A, B) ... 22897.69396 22349.3155 ...So BLAS is over 1000x faster than a naive matrix multiplication in R!
Cases where you can beat R’s basic matrix multiplication is usually if you want to perform particularly strange operations, like if we needed to have t(A) %*% t(B). In the BLAS code we can make the following change.
SEXP mat_mult_ta_tb(...)
{
char * TRANSA = "T";
char * TRANSB = "T";
...
}Then we can compare again to base R.
> microbenchmark(
+ t(A) %*% t(B),
+ mat_mult_ta_tb(A, B)
+ )
Unit: microseconds
expr ... mean median ...
t(A) %*% t(B) ... 64.13438 62.6235 ...
mat_mult_ta_tb(A, B) ... 35.48643 29.7255 ...Here, by the nature of how R functions, the matrices are transposed before they are multiplied, which is a relatively expensive operation. BLAS takes advantage of the fact that multiplying a transpose only requires you switch indices used for the rows and columns to save on computation. This is a cherry-picked pathological example. The more likely use-cases are covered by crossprod() and tcrossprod() which are implemented using BLAS and highly efficient.
So What?
So all this work and we’ve gained nothing, but I think there are a few useful take-aways here.
- Many matrix operations in R are implemented as efficient BLAS routines.
- R’s use of BLAS is more efficient than naive uses of BLAS.
- BLAS can be easily a thousand times faster than naive R code.
If you want peak performance, you want to transform your problem into a matrix operation defined by BLAS, then find the R function which implements that function. If some BLAS operations exist but are not implemented in base R then you’re better off writing more readable matrix operations in RcppArmadillo or RcppEigen which will generate the BLAS automatically.
BLAS is in fact more powerful than I showed, because I am merely using the default BLAS library packaged with R. There are implementations like MKL that are intensively optimised for Intel CPUs, and cuBLAS which dispatches certain BLAS operations on Nvidia GPUs which yield even greater performance.
