Deep Learning in Rust: baby steps

Meta Edit 2/2/2016: I wrote this post when deeplearn-rs was only a week old. This is more of a journal post where I reflect on things. To see the latest deeplearn-rs, the readme on the github repo is probably the best place to look.

Last semester[1] I flailed around trying to get into deep learning. I felt kind of guilty for jumping onto the hype train[2]. Deep Learning is a guilty pleasure. I went through tutorials to relearn error backpropagation[3]. This course in particular helped me a lot.

After gaining a concrete understanding of the mechanics of deep learning, I wanted to experiment with real data and build real networks. I read lots of papers, but struggled to get even a few lines of code out. This was frustrating, because I’m the type of guy that churns code out. I was also kind of burning myself out, because my main project at the time was a ZFS driver for Redox. Dividing my brain between the behemoth that is ZFS and the still-young field of Deep Learning was exhausting and unproductive. So I shelved Deep Learning for a bit.

I was excited for winter break. I told myself I was going to get stuff done. I got some things done. I merged a massive PR for ZFS on Redox that I had been working on for about a month. My friend and I made the beginnings of a crappy stop motion fighting game. I played around with Tensorflow, went through some of their tutorials, and extended the language model tutorial code, but I wanted more. I struggled because I’ve never designed and carried out a full deep learning experiment with a modern framework. Nothing truly satisfying got done until the very end of winter break.

I found a nice compromise in a project that involved both deep learning and systems programming — a deep learning framework in Rust that uses OpenCL to do all the heavy lifting. Luckily, I had already written an OpenCL matrix math library[4]!

deeplearn-rs

I started hacking on this a week ago, and I’m pretty satisfied with how far it’s come. You can construct and (manually) train a deep computation graph, and it all runs on the GPU via OpenCL. Take a gander:

Not as easy on the eyes as Theano or Keras or Tensorflow, I know. It’s hard to beat python in the readability department[5], though. Anyway, this is lowest layer. I intend to write a `GraphBuilder` that makes constructing these networks a little more cushy and a little less error prone.

I’ll just throw the deeplearn-rs github here.

If you don’t know about deep learning and computation graphs, I’ll try to elaborate. It turns out that artificial neural networks can be represented in a more general framework — computation graphs. Conveniently, artificial neural networks can even be represented by much more compact (in terms of node and edge count) graphs of matrix operations. For example, fully connected layers use matrix multiplication to multiply the weights and inputs and sum the products for each node. For each layer, all the inputs are represented as one matrix, and all the weights are represented as another. So convenient! Praise the math gods! It turns out that this makes it easy for us to implement super fast deep neural networks by running our (highly parallelizable) matrix operations on (highly parallel) GPUs.

Internals

Rust has strict ownership and borrowing rules. To make a long story short, you can’t just dole out pointers to things like you can in most other languages[6]. So, instead I use indices into arrays. This way I don’t have to deal with all of the headaches brought by ubiquitous use of `Rc<RefCell<T>>`[7]. Let’s see what a node looks like:

Every matrix[8] in a computation graph is represented as a variable. Variables are accessed using a corresponding `VarIndex`. All variables are stored in the graph’s `VarStore`. VarStore uses `RefCell` to hide variable mutability so that multiple variables can be borrowed simultaneously, regardless of mutability. There is never a reason to borrow the same variable more than once at a time for our purposes.

Note the `out_grad` field is a list of `OutGrad`. OutGrad is a sort of CoW (Copy on Write) VarIndex to a gradient variable. When calculating gradients during the backward pass of the graph, if a variable is forked, (if a variable is used as input to multiple operations) there will be multiple gradients on the variable. If there are multiple gradients, they need to be summed to produce the final gradient, and a variable must be made to store that sum. If there is only one gradient, we can just use that gradient variable. OutGrad will only allocate a new variable if it needs to.

Let’s look at the code for `Graph`, which ties everything together.

The most interesting function here is `Graph::add_node`. You give it an operation and a list of inputs and it’ll handle the rest. It creates the output variables and creates a new gradient for each of the inputs.

Lastly, let’s look at the implementation of matrix multiplication:

The boilerplate variable fetching is a bummer.

Conclusion

Overall, I like the design so far. The purpose of deeplearn-rs is to provide a framework to build trainable matrix computation graphs that are configurable at runtime.

Moving forward, some things I’d like to do:

  • Implement some loss functions
  • Train a fuzzy xor network
  • Automate the training process, and provide several different trainers such as vanilla SGD, ada grad, etc.

[1] Fall 2015
[2] But then less guilty because I was into artificial neural networks back in high school, and deep learning is a generalization of those ideas
[3] Error backpropagation never really rolled off the tongue (so many syllables!)
[4] The matrix math library also got some sweet upgrades
[5] Nigh impossible
[6] In safe Rust you can’t, but in unsafe Rust, anything is possible…
[7] `Rc<T>` is a reference counted pointer, but it demands that the inner type be immutable. `RefCell` hides mutability at compile time and ensures that all mutable references are exclusive at run time. Any number of immutable borrows from a value can exist at any given time, but a mutable borrow must be the only borrow for the duration of its lifetime.
[8] Whether an input, a weight, a gradient, etc.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.