High performance tensor library in Nim

Mamy André-Ratsimbazafy

14 min readOct 15, 2017

Toward a (smoking !) high performance tensor library in Nim

Forewords

In April I started Arraymancer, yet another tensor library™, in yet another language™, Nim.

Why you ask?

I wanted to have fun. When I’m bored, I like to learn new things, research them “in-depth” and do things with that.

It’s a rabbit hole, really:

14 years ago, it was video processing and encoding. I remember forcing my PC to a crawl while using Avisynth and this awesome FFT3DGPU to denoise videos taken from handheld camera.
13 years ago, I learned go, the board game. After a year I was almost dan level (black belt), spent weekends and roam hundreds of km per year for tournaments including French national championship and European Go Congress.
7 years ago, it was Linux, with Gentoo being my first distribution. Now I’m sitting on a server with my machine/deep learning stack and GPU virtualized in LXC containers, NAS storage and security stack virtualized via KVM.
6 years ago, it was programming and number theory, in Haskell, no for loops anyone?
2 years ago, still programming and algorithms: a go-playing bot using Monte-Carlo, in Rust. The borrow-checker and me are not friends.
1 year ago, data science, I even bought a GPU just to do deep learning and warm up my home in winter.

In January I geared up for Data Science Bowl #3, a 1 million dollar competition to predict lung cancer sponsored by the US National Cancer Institute, downloaded 160GB of 3D lung scans and …

Damn Python/Numpy/Scipy was slow, I spent more time preprocessing the images with watershed segmentation (a full night) than training my neural network (3 hours).

Image courtesy Ashish Shah, Guido Zuidhof and Arnav Jain

That means that I couldn’t test as fast as I wished. I just stopped there and wanted to learn more about neural networks and computer vision behind the scene. I watched Standford’s CS231n courses and did the Hacker’s guide to neural network tutorial by Andrej Karpathy.

It was in Javascript, I wanted to learn another language I stumbled upon after my frustrations with Rust: Nim. So here I go.

Until I reached this:

// random initial parameters
var a1 = Math.random() - 0.5; // a random number between -0.5 and 0.5
// ... similarly initialize all other parameters to randoms
for(var iter = 0; iter < 400; iter++) {
  // pick a random data point
  var i = Math.floor(Math.random() * data.length);
  var x = data[i][0];
  var y = data[i][1];
  var label = labels[i];

  // compute forward pass
  var n1 = Math.max(0, a1*x + b1*y + c1); // activation of 1st hidden neuron
  var n2 = Math.max(0, a2*x + b2*y + c2); // 2nd neuron
  var n3 = Math.max(0, a3*x + b3*y + c3); // 3rd neuron
  var score = a4*n1 + b4*n2 + c4*n3 + d4; // the score

  // compute the pull on top
  var pull = 0.0;
  if(label === 1 && score < 1) pull = 1; // we want higher output! Pull up.
  if(label === -1 && score > -1) pull = -1; // we want lower output! Pull down.

  // now compute backward pass to all parameters of the model

  // backprop through the last "score" neuron
  var dscore = pull;
  var da4 = n1 * dscore;
  var dn1 = a4 * dscore;
  var db4 = n2 * dscore;
  var dn2 = b4 * dscore;
  var dc4 = n3 * dscore;
  var dn3 = c4 * dscore;
  var dd4 = 1.0 * dscore; // phew

  // backprop the ReLU non-linearities, in place
  // i.e. just set gradients to zero if the neurons did not "fire"
  var dn3 = n3 === 0 ? 0 : dn3;
  var dn2 = n2 === 0 ? 0 : dn2;
  var dn1 = n1 === 0 ? 0 : dn1;

  // backprop to parameters of neuron 1
  var da1 = x * dn1;
  var db1 = y * dn1;
  var dc1 = 1.0 * dn1;
  
  // backprop to parameters of neuron 2
  var da2 = x * dn2;
  var db2 = y * dn2;
  var dc2 = 1.0 * dn2;

  // backprop to parameters of neuron 3
  var da3 = x * dn3;
  var db3 = y * dn3;
  var dc3 = 1.0 * dn3;

  // phew! End of backprop!
  // note we could have also backpropped into x,y
  // but we do not need these gradients. We only use the gradients
  // on our parameters in the parameter update, and we discard x,y

  // add the pulls from the regularization, tugging all multiplicative
  // parameters (i.e. not the biases) downward, proportional to their value
  da1 += -a1; da2 += -a2; da3 += -a3;
  db1 += -b1; db2 += -b2; db3 += -b3;
  da4 += -a4; db4 += -b4; dc4 += -c4;

  // finally, do the parameter update
  var step_size = 0.01;
  a1 += step_size * da1; 
  b1 += step_size * db1; 
  c1 += step_size * dc1;
  a2 += step_size * da2; 
  b2 += step_size * db2;
  c2 += step_size * dc2;
  a3 += step_size * da3; 
  b3 += step_size * db3; 
  c3 += step_size * dc3;
  a4 += step_size * da4; 
  b4 += step_size * db4; 
  c4 += step_size * dc4; 
  d4 += step_size * dd4;
  // wow this is tedious, please use for loops in prod.
  // we're done!
}

No way I’m coding that, so I wrote an autograd to automatically differentiate functions, available here.

Then I wanted to generalize this autograd to vectors, matrices and tensors, 4D tensors especially so I could pass N images, C color channels (RGB), H height and W width.

Nim ecosystem being young (Nim is 8 years old, Python 27 years old), nothing suited my needs so I started my own with the goals of:

Having fun
Being ergonomic
Being fast

Toward a (smoking !) high performance numerical computing library in Nim
Forewords
Table of Contents
Why Nim?
Ergonomics
Cognitive overload
Generics
Flexible static-typing
Operator overloading and infix operators
Functional programming facilities
Metaprogramming: the boilerplate squashing tool
Syntax for humans
Performance
What’s the result?
How? Low-level memory management in Nim
How? Data parallelism in Nim
How? Controlling overhead
Portability
Compiles to C, C++, Objective-C and Javascript
Calling C, C++, Obj-C, javascript libraries is easy
Arraymancer
The end

Why Nim?

Ergonomics

Cognitive overload

Building something fast but too hard to use is a tragedy. Scientific computing is an exercise in fast experimentations and iterations and the programming language should be in the shadows so that the scientist focus on his algorithms.

That’s an obvious fail for Rust, the borrow checker brings too much cognitive overload.

Generics

Furthermore maths and science is all about generalization and extracting overarching concepts out of things.

I don’t want to write this kind of things in 2017:

func Abs(x float32) float32 {
    switch {
    case x < 0:
        return -x
    case x == 0:
        return 0 // return correctly abs(-0)
    }
    return x
}

func Abs(x float64) float64 {
    switch {
    case x < 0:
        return -x
    case x == 0:
        return 0 // return correctly abs(-0)
    }
    return x
}

func Abs(x int32) int32 {
    switch {
    case x < 0:
        return -x
    case x == 0:
        return 0 // return correctly abs(-0)
    }
    return x
}

...

This is go by the way taken from gonum library. Just give me generics like in C++, Rust, Haskell ...

# (I know there is a `abs` function in the "math" library)
proc abs[T: SomeSignedInt or SomeReal](x: T): T =
    if x < 0:
        return -x
    elif x == 0:
        return 0 # return correctly abs(-0)
    return x

Here I just tell the compiler that only signed integer (int32, int64) and floating points (float32, float64) input are valid.

Flexible static-typing

Static-typing is often seen as an obstacle in science for fast iterations. This does not fly. So that Python numerical libraries are robust, most functions in Python are buried deep in input checks to make sure the input is correct.

def abs(x):
  if x not (isInstance(int) or is instance(...):
    raise TypeError

    if x < 0:
        return -x
    elif x == 0:
        return 0 # return correctly abs(-0)
    return x

Furthermore, if physicist were to know about the following features we might have avoided airplanes and space shuttle crashes:

type
  Meters = distinct float64
  Miles = distinct float64

let a = Meters(1000.0)
let b = Miles(1)

let c = a + b # Error: type mismatch: got (Meters, Miles)

Note: Nim provides tools to facilitate implicit conversion

# Allow multiplication by a generic float
proc `*` *(a: Miles, b: float64): Miles {.borrow.} # borrow means use the same implementation as float64 base type

# Create the implicit conversion with a `converter` function
converter toMeters(x: Miles): Meters =
  return Meters(x * 1609.344)

# Allow addition between Meters
proc `+`(a, b: Meters): Meters {.borrow.}

let c = (a + b.toMeters) # it works !

You probably noticed that you had to tell which operators were valid like + and *. This is actually a feature because multiplying Meters should give Meters2 not Meters

type Meters2 = distinct float64

proc `*` *(a, b: Meters): Meters2 {.borrow.} # borrow means use the same implementation as float64 base type

How awesome is that? Note: this is actually also useful in Finance to avoid adding two different currencies or British pounds with pennies.

Operator overloading and infix operators

In the previous section you saw * and +, Nim allows you to reuse the same operators to actually mean different things depending on the input. * between two float64 give a float64, but between Meters it gives Meters2, between matrices and a vector it gives a vector, etc.

Also Nim allows you to create your own operator by combining any number of these symbols:

=     +     -     *     /     <     >
@     $     ~     &     %     |
!     ?     ^     .     :     \

So I could create proc * to mean matrix multiplication and .* to mean elementwise multiplication like in Julia and Matlab. I could even use <*> to mean Haskell applicative map but let's not go there.

Functional programming facilities

Functions are first-class citizen in Nim and can be passed to other functions. I let my solution to Project Euler problem #5 speak for me.

# 2520 is the smallest number that can be divided by each of the numbers from 1 to 10 without any remainder .
# What is the smallest positive number that is evenly divisible by all of the numbers from 1 to 20?

from sequtils import toSeq, foldr
from math import lcm

echo toSeq(1..20).foldr(lcm(a,b)) # 232792560

Metaprogramming: the boilerplate squashing tool

I’m fond of DRY (Don’t Repeat Yourself). Object-Oriented Programming, getter, setters, etc are not for me, Java especially is awfully guilty of inflating the number of lines of code to the extreme.

I like my programs lean and fast.

So let’s say I want to implement “universal functions” in my library, in the Numpy jargon that means a function that applies element-wise to all elements of the multidimensional array, like sin, cos, exp, sqrt, ln/log ...

Ingredient:

A way to apply a function on all elements of the array (say map)
Exporting that function for all to use.

How boring would it be to use that:

# Supposing that sqrt, cos, etc exist for T

proc sqrt*[T](t: Tensor[T]): Tensor[T] =
  t.map(sqrt)

proc cos*[T](t: Tensor[T]): Tensor[T] =
  t.map(cos)

...

Instead you can use a template that will do the boilerplate for you:

template makeUniversal*(func_name: untyped) =
  proc func_name*(t: Tensor): Tensor =
    t.map(func_name)
  export func_name

makeUniversal(sqrt)
makeUniversal(cbrt)
makeUniversal(ln)
makeUniversal(log10)
makeUniversal(log2)
makeUniversal(exp)
makeUniversal(arccos)
makeUniversal(arcsin)
makeUniversal(arctan)
makeUniversal(cos)
makeUniversal(cosh)
makeUniversal(sinh)
makeUniversal(sin)
makeUniversal(tan)
...

You might say, that I saved 1 line per new function, useless.
Actually this is EXTREMELY useful to wrap Cuda Kernels, like so:

template cuda_assign_op(op_name, op_symbol: string)=
  {.emit: ["""
  template<typename T>
  struct """,op_name,"""{
  __device__ __forceinline__ void operator()(
      T *  __restrict__ dst,
      const T *  __restrict__ src){
      *dst """,op_symbol,""" __ldg(src);
      }
  };
  """].}

cuda_assign_op("CopyOp", "=")
cuda_assign_op("mAddOp", "+=")
cuda_assign_op("mSubOp", "-=")
cuda_assign_op("mMulOp", "*=")
cuda_assign_op("mDivOp", "/=")

Suddenly it’s 10 lines saved per operands, convinced?

Ok, last try, thanks to Nim metaprogramming I could implement this very nice slicing syntax by just bending the language to my will, like Neo in the Matrix:

##    - Basic indexing - foo[2, 3]
  ##    - Basic indexing - foo[1+1, 2*2*1]
  ##    - Basic slicing - foo[1..2, 3]
  ##    - Basic slicing - foo[1+1..4, 3-2..2]
  ##    - Span slices - foo[_, 3]
  ##    - Span slices - foo[1.._, 3]
  ##    - Span slices - foo[_..3, 3]
  ##    - Span slices - foo[_.._, 3]
  ##    - Stepping - foo[1..3\|2, 3]
  ##    - Span stepping - foo[_.._|2, 3]
  ##    - Span stepping - foo[_.._|+2, 3]
  ##    - Span stepping - foo[1.._|1, 2..3]
  ##    - Span stepping - foo[_..<4|2, 3]
  ##    - Slicing until at n from the end - foo[0..^4, 3]
  ##    - Span Slicing until at n from the end - foo[_..^2, 3]
  ##    - Stepped Slicing until at n from the end - foo[1..^1|2, 3]
  ##    - Slice from the end - foo[^1..0|-1, 3]
  ##    - Slice from the end - expect non-negative step error - foo[^1..0, 3]
  ##    - Slice from the end - foo[^(2*2)..2*2, 3]
  ##    - Slice from the end - foo[^3..^2, 3]

Syntax for humans

There is a say that a dev spent 90% of his time reading code.
I read a lot, and fast. Except that when I’m reading books I don’t have to read things like this:

let my_str: Rc<RefCell<Box<MyStruct1>>> = Rc::new(RefCell::new(Box::new(MyStruct1)));

Too much visual noise.

Also curly braces are not sexy anymore. In any case even if there are curly braces it does not excuse you from making your code readable with indentation for logical blocks (if/else, for loop, function definition …).
Why use curly braces to tell the computer that it’s a logical block and use indentation for humans while you can do both with just indentation?

Performance

Nim has very high performance by default, and allow you to tune code at a very low level.

Numerical computing is, contrary to the expectations, more dependant on memory speed than on your CPU compute muscles. Today CPUs are so fast that it’s a challenge to feed them enough data in a computation cycle.

So I needed a language that allows you to tinker with memory layout, hence exit traditional high-level languages like Python, Julia, Java …

I needed to manipulate pointers and interact with C and C++ for Nvidia Cuda.

I needed a good data parallelism library because Python Global Interpreter Lock (GIL) frustrated me to no end. Basically due to the GIL to have multiprocessing in Python you have to:

Either copy the data on each of your cores, good luck with a 8Go batch of images.
Code in C or Cython and tell Python not to lock.

What’s the result?

On exact integer matrix multiplication which seemed to be the abandoned children of tensors libraries as everyone is optimising floating points, I am 10x faster than Julia and 22x faster than Numpy.

That means that a computation that would take 1 day now only takes about a hour.

Archlinux, E3-1230v5 (Skylake quad-core 3.4 GHz, turbo 3.8)
Input 1500x1500 random large int64 matrix
Arraymancer 0.2.90 (master branch 2017-10-10)

Language Speed Memory Nim 0.17.3 (devel) + OpenMP 0.36s 55.5 MB Julia v0.6.0 3.11s 207.6 MB Python 3.6.2 + Numpy 1.12 compiled from source 8.03s 58.9 MB

How? Low-level memory management in Nim

With careful tuning from this high performance computing course (in C, C++ and Assembler), and then going down at a very low level:

# 99% of matrix multiplication time is spent here
  # I tell the compiler:
  #   - that my data is aligned on 64 bytes boundaries and can use intrinsics
  #   - via restrict that only a single variable/pointer can access the memory location at any single time (no aliasing)
  #   - I unroll a double nested 4x4 loop into 16 assignment statements for optimization. (MRNR = 16)

  {.pragma: align64, codegenDecl: "$# $# __attribute__((aligned(64)))".}
  var AB{.align64.}: array[MRNR, T]
  var voffA = offA
  var voffB = offB

  {.pragma: restrict, codegenDecl: "$# __restrict__ $#".}
  var a {.restrict.}= assume_aligned(A.data, FORCE_ALIGN)
  var b {.restrict.}= assume_aligned(B.data, FORCE_ALIGN)

  ## Compute A*B
  for _ in 0 ..< kc:
    # for j in 0 ..< NR:
    #   for i in 0 .. < MR:
    #     AB[i + j*MR] += A[i + voffA] * B[j + voffB]
    unroll_ukernel(AB, a, voffA, b, voffB, MR, NR)
    voffA += MR
    voffB += NR

How? Data parallelism in Nim

All map, fold and reductions functions in Arraymancer are parallelized with OpenMP (thank you @edubart). Nim OpenMP interface is easy, from Rosetta code:

# Serial version

const str = ["Enjoy", "Rosetta", "Code"]

for i in 0..2:
  echo str[i]# OpenMP parallel version

const str = ["Enjoy", "Rosetta", "Code"]

for i in 0||2:
  echo str[i]

How? Controlling overhead

Overhead comes in various forms, most notably memory allocation and function calls.

There are two types of allocations:

the stack, very fast, but small (1MB to 8MB).
the heap, slow, as big as your RAM.

In many high-level language everything is on the heap.

Because it was easy, I used to allocate the tensor metadata (2D 10x100 tensor for example) on the heap.
In this benchmark I was doing 10000 * 100 perceptrons computations in a loop and the program was spending more time creating and destroying memory than doing actual computation.
I moved all to stack-allocated objects and cut down the time from 45+ seconds to 13 seconds.

Also, while functional programming is nice, it is also costly.

Doing toSeq(1..1_000_000).map(square) to square a 1000000 integer sequence will forces you to do
1_000_000 function calls.

When doing a function call, the CPU saves its working area (registers) to the stack, calls the function and then must retrieve again the data from the working area, 1 million times. Remember what I said with memory being THE bottleneck?

Nim allows you to almost inline anything especially trivial computation like that.

You can rewrite map that inline its input with a template:

template map_inline[T: SomeNumber](s: seq[T], operation: untyped) =
  var result: seq[T] = @[]
  for x {.inject.} in items(s):
    result.add(operation)
  result

{.inject.} is a special notation that exposes x to the outside, it can be used like this

toSeq(1..1_000_000).map_inline(x * x)

And that code will be transformed transparently into:

var result: seq[type(x)] = @[]
for x in items(toSeq(1..1_000_000)):
  result.add(x * x)
result

Portability

Compiles to C, C++, Objective-C and Javascript

I almost forgot, Nim compiles to optimized C, C++, Objective-C and Javascript.

Yes Javascript (and soon WebAssembly) is one of its target. Yes, javascript! If writing a NES console emulator (with Super Mario Bros) in Nim, playable in the browser is possible, anything is possible.

So with Nim you can target anything from PC to embedded devices , phones to browsers and obviously graphics cards through CUDA, OpenCL or OpenGL.

Calling C, C++, Obj-C, javascript libraries is easy

Calling C from Nim is really easy, at one point I wanted the log1p function which computes ln(1+x) and does not fail badly due to rounding if x is near 0 (and is very useful in machine learning because loss functions are often in log space and you want to minimize the error). Import of C functions works like this:

proc ln1p*(x: float32): float32 {.importc: "log1pf", header: "<math.h>".}
proc ln1p*(x: float64): float64 {.importc: "log1p", header: "<math.h>".}
  ## Compute ln( 1+x ) and avoids catastrophic cancellation if x << 1
  ## i.e. if x << 1 ln(1+x) ~= x but normal float rounding would do ln(1) = 0 instead.

Nim provides the same import facilities for C++, Obj-C and Javascript (including C++ templates)

Arraymancer

Wow already so many lines, let’s wrap up.

I started Arraymancer 6 months ago, it now has most of the building blocks ready to provide great ergonomics with blazing fast performance, on CPU and Cuda.

Actually for CPU Arraymancer 0.2.0 was already faster than Facebook Torch that is written completely in C. Arraymancer master branch is probably about 2x faster than Torch now due to parallel reduction and huge optimizations in memory allocation.

Next steps: building all the tools for Convolutional Neural Networks.

Next next steps: Leveraging Nim convenient syntax for model research, robustness and scalability for production and JS backend for interoperability (REST API yeah !) in one neat tool.

The end

I hope I gave you a great overview of Nim capabilities for High-Performance Computing and next time you consider C or Fortran for your HPC needs, look at yourself and seriously consider Nim.

Want to use discover, discuss, try or contribute? Arraymancer’s Github is here.

Originally published at Marie & Mamy’s Insights.