Jim Fleming - Medium

Running TensorFlow (with GPU) on Kubernetes

Jim Fleming — Fri, 24 Mar 2017 22:50:02 GMT

While GPUs are a staple of deep learning, deploying on GPUs makes everything more complicated, including your Kubernetes cluster. This quick guide will walk through adding basic single-GPU support to Kubernetes.

The guide assumes that Kubernetes is already running on Ubuntu. A LTS release is preferable, with 14.04 being most preferable due to NVIDIA recommendations for driver hosts. Warning: Ubuntu 14.04 is not well supported by Kubernetes. Feel free to use a different distro. This guide also assumes that the proper GPU drivers and CUDA version have been installed. Plenty of other guides cover those topics.

TL;DR: start with nvidia-docker, then whittle away it’s functionality so that just plain docker remains. Then add that functionality to Kubernetes.

Working without nvidia-docker

A common way to run containerized GPU applications is to use nvidia-docker. Here is an example of running TensorFlow with full GPU support inside a container.

https://medium.com/media/e00e0a7edbb9cc6e01dc73929dfb4f25/href

Simple! If all goes well the output should look something like this:

https://medium.com/media/74ea3bbe84e4f164815b9e55c650e789/href

Unfortunately it’s not current possible to use nvidia-docker directly from Kubernetes. Additionally, Kubernetes does not support the nvidia-docker-plugin since Kubernetes does not use Docker’s volume mechanism.

The goal is to manually replicate the functionality provided by nvidia-docker (and it’s plugin). For demonstration, query the nvidia-docker-plugin REST API to query the command line arguments:

https://medium.com/media/fd936ae4933d3f8b769c724fc0cd5321/href

Which will feed into docker, running the same python command:

https://medium.com/media/d14ed8674a73ef2b3fee62a2c45aee68/href

If all does well, TensorFlow should find everything correctly and you should see the same output as before.

Finally, the dependency on nvidia-docker-plugin by manually specifying the driver path and manually mounting the devices and CUDA volumes.

https://medium.com/media/b9a25c0d5c8fb2e23244e5158ad596ba/href

Note that this still uses nvidia-docker’s driver volume for discovery. While Kubernetes cannot call the plugin directly we can use the filesystem.

Enabling GPU devices

With the knowledge of what Docker needs to be able to run a GPU-enabled container it is straightforward to add this to Kubernetes. The first step is to enable an experiment flag on all of the GPU nodes. In the Kubelet options (found in /etc/default/kubelet if you use upstart for services), add --experimental-nvidia-gpus=1. This does two things… First, it allows GPU resources on the node for use by the scheduler. Second, when a GPU resource is requested, it will add the appropriate device flags to the docker command. This post describes a little more about what and why this flag exists:

http://blog.clarifai.com/how-to-scale-your-gpu-cloud-infrastructure-with-kubernetes

The full GPU proposal, including the existing flag and future steps can be found here:

https://github.com/kubernetes/community/blob/master/contributors/design-proposals/gpu-support.md

Pod Spec

With the device flags added by the experimental GPU flag the final step requires adding the necessary volumes to the pod spec. A sample pod spec is provided below:

https://medium.com/media/8b5a3131123caae80e6d75acb7f863f4/href

If set up correctly the output should match the output from running the nvidia-docker container output at the beginning:

https://medium.com/media/74ea3bbe84e4f164815b9e55c650e789/href

Conclusion

Hopefully this guide helps someone wade through these undocumented features to make use of GPUs in their cluster.

Follow me on Twitter for more posts like these. We also do applied research to solve machine learning challenges.

Running TensorFlow (with GPU) on Kubernetes was originally published in Jim Fleming on Medium, where people are continuing the conversation by highlighting and responding to this story.

Notes on Hierarchical Multiscale Recurrent Neural Networks

Jim Fleming — Fri, 24 Mar 2017 22:49:53 GMT

Introduces a novel update mechanism to learn latent hierarchical representations from data.

Introduction

State-of-the-art on PTB, Text8 and IAM On-Line Handwriting DB. Tied for SotA on Hutter Wikipedia.

Lots of prior work with hierarchy (hierarchical RNN / stacked RNN) and multi-scale (LSTM, clockwork RNN) but they all rely on pre-defined boundaries, pre-defined scales, or soft non-hierarchical boundaries.

Two benefits of discrete hierarchical representations:

Helps vanishing gradient since information is held at higher levels for more steps.
More computationally efficient in the discrete case since higher layers update less frequently.

Model

Uses parameterized binary boundary detectors at each layer. Avoids “soft” gating which leads to “curse of updating every timestep”.

Boundary detectors determine operations for modifying RNN state: COPY, FLUSH, UPDATE:

UPDATE: similar to LSTM but sparse, according to boundary detector.
COPY: copies cell and hidden states from the previous timestep to the current timestep. Similar to Zoneout (recurrent generalization of stochastic depth) which uses Bernoulli distribution to copy hidden state across timesteps.
FLUSH: sends summary to next layer and re-initializes current layer’s state.

Discrete (binary) decisions are difficult to optimize due to non-smooth gradients. Uses straight-through estimator (as an alternative to REINFORCE) to learn discrete variables. The simplest variant uses a step function on the forward pass and a hard sigmoid on backward pass for gradient estimation.

The slope annealing trick on the hard sigmoid compensates for the biased estimator but minimal improvement from experimental results. Also introduces more hyperparameters.

Implemented as a variant of LSTM (HM-LSTM) with custom operations above. No experimental results for variant with regular RNN (HM-RNN).

Results

Learns useful boundary detectors, visualized in the paper.

Latent representations possibly imperfect, or at least, not human: spaces, tree breaks, some bigrams, some prefix delineation (“dur”: during, duration, durable).

Only results on character-level compression tasks and handwriting, no explicit NLP tasks, e.g. machine translation, question-answering, or named entity recognition.

Conclusion

Thanks to those who attended the reading group session for their discussion of this paper! Lots of good insights from everyone.

Follow me on Twitter for more posts like these. We also do applied research to solve machine learning challenges.

Notes on Hierarchical Multiscale Recurrent Neural Networks was originally published in Jim Fleming on Medium, where people are continuing the conversation by highlighting and responding to this story.

Notes on the Numerai ML Competition

Jim Fleming — Mon, 19 Sep 2016 16:24:44 GMT

Photo courtesy Unsplash

Last week I spent some time diving into the Numerai machine learning competition. Below are my notes on the competition: things I tried, what worked and what didn’t. First an introduction to Numerai and the competition…

Numerai is a hedge fund which uses the competition to source predictions for a large ensemble that they use internally to make trades. Another detail that makes the competition unique is that the provided data has been encrypted in a way that still allows it to be used for predictions. Each week, Numerai releases a new dataset and the competition resets. After briefly controlling 1st-2nd place in both score and originality, by the end of the week I was still “controlling capital” with a log loss of 0.68714. In all this earned about $8.17 USD worth of Bitcoin.

Here’s a sample of the training data:

https://medium.com/media/6dae8db6b89a14004443bfb44201e62d/href

Validation

My first step in the competition was to generate a validation set so that I could run models locally and get a sense for how the models would do on the leaderboard. Using a simple stratified split that maintains the target distribution turned out not to be representative of the leaderboard so I turned to “adversarial validation”. This clever idea was introduced by @fastml in a blog post here. Basically:

Train a classifier to identify whether data comes from the train or test set.
Sort the training data by it’s probability of being in the test set.
Select the training data most similar to the test data as your validation set.

This was much more representative with a validation loss corresponding to within ~0.001 log loss on the public leaderboard. Interestingly, the only reason this works is that the test data is dissimilar from much of the training data which violates IID.

Baseline Model

Now that I had a good validation set I wanted to get a baseline model trained, validated and uploaded. As a starting point I used logistic regression with default settings and no feature engineering. This gets about 0.69290 validation loss and 0.69162 on the public leaderboard. It’s not great but now I know what a simple model can do. For comparison, first place is currently 0.64669, so the baseline is only about 6.5% off. This means any improvements are going to be really small. We can push this a little further with L2 regularization at 1e-2 which gets to 0.69286 (-0.006% from baseline).

Neural Networks

I took a quick divergence into neural networks before beginning feature engineering. Ideally, the networks would learn their own features with enough data, unfortunately none of the architectures I tried had much improvement over simple logistic regression. Additionally, deep neural networks can have far more learned parameters than logistic regression so I needed to regularize the parameters heavily with L2 and batch normalization (which can act as a regularizer per the paper). Dropout sometimes helped too depending on the architecture.

One interesting architecture that worked okay was using a single very wide hidden layer (2048 parameters) with very high dropout (0.9) and then leaving it’s initialized parameters fixed during training. This creates an ensemble of many random discriminators. While this worked pretty well (with a logloss around 0.689) the model hurt the final ensemble so it was removed. In the end neural networks did not yield enough improvement to continue their use here and would still rely on feature engineering which defeated my intentions.

Data Analysis & Feature Engineering

Now I need to dig into the data, starting with a simple plot of each of the feature distributions:

Violin plot of the distributions for each feature.

The distributions are pretty similar for each feature and target. How about correlations between features:

Correlation matrix showing feature interactions.

Okay, so many of the features are strongly correlated. We can make use of this in our model by including polynomial features (e.g. PolynomialFeatures(degree=2) from scikit-learn). Adding these brings our validation loss down to 0.69256 (-0.05% from baseline).

Now dimensionality reduction. I take the features and run principal component analysis (a linear method) to reduce the original features down to two dimensions for visualization:

PCA dimensionality reduction over original features.

This does not contain much useful information. How about with the polynomial features:

PCA dimensionality reduction over polynomial features.

The polynomial PCA produces a slightly better result by pulling many of the target “1” values towards the edges and many of the target “0” values towards the center. Still not great so I opted to omit PCA for now.

Instead I’ll use a fancier dimensionality reduction method called t-SNE or “t-Distributed Stochastic Neighbor Embedding”. t-SNE is often used for visualization of high-dimensional data but it has a useful property not found in PCA: t-SNE is non-linear and works on the probability of two points being selected as neighbors.

t-SNE embedding over the features; clusters colored using DBSCAN.

Here t-SNE captured really good features for visualization (e.g. local clusters), and incidentally for classification too! I add in these 2D features to the model to get the best validation loss so far: 0.68947 (-0.5% from baseline). I suspect the reason this helps is that there are actually many local features that logistic regression cannot pull out but are useful in classifying for the target. By running an unsupervised method specifically designed to align the data by pairwise similarities the model is able to use that information.

Since t-SNE is stochastic, multiple runs will produce different embeddings. To exploit this I’ll run t-SNE 5 or 6 times at different perplexities and dimensions (2D and 3D) then incorporate these extra features. Now the validation loss is 0.68839 (-0.65% from baseline).

Note, some implementations of t-SNE do not work correctly in 3D. Plot them to make sure you’re seeing a blob, not a pyramid shape.

Additional Embeddings

Since t-SNE worked so well, I implemented several other embedding methods including autoencoders, denoising autoencoders, and generative adversarial networks. The autoencoders learned excellent reconstructions with >95% accuracy, even with noise but their learned embeddings did not improve the model. The GAN, including semi-supervised variant, did not outperform logistic regression. I also briefly experimented with kernel PCA and isomaps (also non-linear dimensionality reduction methods). Both improved the validation loss slightly but took significantly longer to run, reducing my ability to iterate quickly, so they were ultimately discarded. I never tried LargeVis or parametric t-SNE but they might be worth exploring. Parametric t-SNE would be particularly interesting since it allows fitting on a test holdout, rather than learning an embedding of all of the samples at once.

Isomap embedding of the original features.

Pairwise Interactions

One of the models that made it into the final ensemble was to explicitly model pairwise interactions. Basically, given features from two samples predict which of the two had a greater probability of being classified as “1”. This provides significantly more data since you’re modeling interactions between samples, rather than individual samples. It also hopefully learns useful features for classifying by the intended target. To make predictions for the target classification I take the average of each sample’s prediction against all other samples. (It’s probably worth exploring more sophisticated averaging techniques.) This performed similarly to logistic regression and produced different enough results to add to the ensemble.

Hyperparameter Search

Now that we have useful features and a few models that perform well I wanted to run a hyperparameter search and see if it could outperform the existing models. Since scikit-learn’s GridSearchCV and RandomSearchCV only explore hyperparameters, not entire architectures, I opted to use tpot which searches over both. This discovered that using randomized PCA would outperform PCA and that L1 regularization (sparsity) slightly outperformed L2 regularization (smoothing), especially when paired with random PCA. Unfortunately neither of the discovered interactions made it into the final ensemble: hand engineering won out.

Ensemble

With a few models complete it’s time to ensemble their predictions. There are a number of methods for doing this covered here but I opted for a simple average using the geometric mean.

The final ensemble consisted of 4 models: logistic regression, gradient boosted trees, factorization machines and the pairwise model described above. I used the same features for each model, consisting of the original 21 features and five runs of T-SNE in 2D at perplexities of 5.0, 10.0, 15.0, 30.0, and 50.0 and one run of T-SNE in 3D at a perplexity of 30 (I only included a single run because it takes significantly longer in 3D). These features were combined with polynomial interactions and run through the models to produce the final log loss of 0.68714 on the leaderboard.

Conclusion

Overall it was an interesting competition—very different from something like Kaggle. I especially enjoyed experimenting with the encrypted data which was a first for me. While the payouts and “originality” bonuses are interesting mechanics, it’s often better to look at the rewards as points, more than currency, as this made the competition overall more fun. On the other hand, now I have my first bitcoin… :)

Code: https://github.com/jimfleming/numerai

Follow me on Twitter for more posts like these. We also do applied research to solve machine learning challenges.

Notes on the Numerai ML Competition was originally published in Jim Fleming on Medium, where people are continuing the conversation by highlighting and responding to this story.

Before AlphaGo there was TD-Gammon

Jim Fleming — Mon, 04 Apr 2016 15:35:05 GMT

Théodore Rombouts — The Backgammon Players

TL;DR Introduces temporal difference learning, TD-Lambda / TD-Gammon, and eligibility traces. Check out the Github repo for an implementation of TD-Gammon with TensorFlow.

A few weeks ago AlphaGo won a historic tournament playing the game of Go against Lee Sedol, one of the top Go players in the world. Many people have compared AlphaGo to DeepBlue, which won a series of famous chess matches against Gary Kasparov, but a different comparison may be made for the game of backgammon.

Before DeepMind tackled playing Atari games or built AlphaGo there was TD-Gammon, the first algorithm to reach an expert level of play in backgammon. Gerald Tesauro published his paper in 1992 describing TD-Gammon as a neural network trained with reinforcement learning. It is referenced in both Atari and AlphaGo research papers and helped set the groundwork for many of the advancements made in the last few years.

Temporal-Difference Learning

TD-Gammon consists of a simple three-layer neural network trained using a reinforcement learning technique known as TD-Lambda or temporal-difference learning with a trace decay parameter lambda (λ). The neural network acts as a “value function” which predicts the value, or reward, of a particular state of the game for the current player.

During training, the neural network iterates over all possible moves for the current player and evaluates each valid move and the move with the highest value is selected. Because the network evaluates moves for both players, it’s effectively playing against itself. Using TD-Lambda we want to improve the neural network so that it can reasonably predict the most likely outcome of a game from a given board state. It does this by learning to reduce the difference between the value for the next state and the current state.

Let’s start with a loss function, which describes how well the network is performing for any state at time t:

Loss function: mean squared error of the difference between our neural network’s output for the next state and the output for the current state. The variable α is a small scalar to control the learning rate.

Here we want to minimize the mean squared error of the difference between the next prediction and the current prediction. Basically, we want our predictions about the present to match our predictions about the future. This in itself isn’t very useful until we know how the game ends so for the final step of the game we modify the loss function:

Same as above, but z represents the true outcome of the game.

Where z is the actual outcome of the game. Together these two loss functions work okay but the network will converge slowly and never reach a strong level of play.

Temporal Credit Assignment

To make our predictions more useful we need to solve the problem of temporal credit assignment. Basically, which actions did the player take in the past that resulted in the desired outcome in the future. Right now the loss only incorporates two consecutive steps and we want to stretch that out.

With the loss function above our parameter updates will look something like this:

Parameter updates for the loss function L.

Where θ is the network’s parameters (weights), α is the learning rate and δ is the difference we defined above:

Definition of δ for intermediate and end-game states where f is the final time-step of the game.

Now rather than include a single gradient we want to include all past gradients while paying more attention to the most recent. This is accomplished keeping a history of gradients then decaying each by increasing amounts of λ that reflect how old the gradient has become:

The full definition for TD-Lambda includes a sum over all previous gradients, decayed by λ.

Eligibility Traces

Keeping a running history of gradients can become memory intensive depending on the size of the network and the length of the game. An elegant solution to this problem is to use something called an “eligibility trace”. Eligibility traces replace the gradient sum of the parameter update with a single moving gradient. The eligibility trace is defined as:

Definition of an eligibility trace decayed by λ.

Basically, we decay our eligibility trace by λ then add the new gradient. With this, our parameter update becomes:

New parameter update for TD-Lambda, using an eligibility trace in place of the gradient.

This effectively allows our parameter updates to take into account decisions made in the past. Now when we backpropagate the end game state, we take into account the gradients from earlier states in the game while we avoid keeping a complete history of gradients.

Results

At the start of training, each game can take hundreds or thousands of turns to complete, effectively taking a random strategy. As the network learns, games require only around 50–100 turns and will outperform an opponent making random moves after around 1000 games (about an hour of training).

The average loss for a game can never really reach zero because there’s more uncertainty at the beginning of a game but it can be useful to visualize convergence:

Average loss for each of 5,000 games.

Conclusion

Hopefully, this post shed some light on a small part of the history of recent deep reinforcement learning papers and the temporal-difference learning algorithm. If you’re interested in learning more about reinforcement learning definitely check out Richard Sutton’s book on the topic. You can also download the code for this implementation of TD-Gammon and play against the pre-trained network included in the repo.

Follow me on Twitter for more posts like these. We also do applied research to solve machine learning challenges.

Before AlphaGo there was TD-Gammon was originally published in Jim Fleming on Medium, where people are continuing the conversation by highlighting and responding to this story.

An LSTM Odyssey

Jim Fleming — Tue, 26 Jan 2016 16:31:19 GMT

Photo from Unsplash

This week I read LSTM: A Search Space Odyssey. It’s an excellent paper that systematically evaluates the different internal mechanisms of an LSTM (long short-term memory) block by disabling each mechanism in turn and comparing their performance. We’re going to implement each of the variants in TensorFlow and evaluate their performance on the Penn Tree Bank (PTB) dataset. This will obviously not be as thorough as the original paper but it allows us to see, and try out, the impact of each variant for ourselves.

TL;DR Check out the Github repo for results and variant definitions.

Vanilla LSTM

We’ll start with a setup similar to TensorFlow’s RNN tutorial. The primary difference is that we’re going to use a very simple re-implementation for the LSTM cell defined as follows:

LSTM equations from section 2.

This corresponds to the “vanilla” LSTM from the paper. Each equation defines a particular component of the block: block input (z), input gate (i), forget gate (f), cell state (c), output gate (o) and block output (y). Both g and h represent the hyperbolic tangent function and sigma represents the sigmoid activation function. The circle dot represents element-wise multiplication.

Here’s the same thing in code:

https://medium.com/media/4ab196d3a391dbb882f9fce7e3507248/href

Be sure to check out the full source for the rest of the cell definition. Mostly we create a new class inheriting from RNNCell and use the above code as the body of __call__. The nice part about this setup is that we can utilize MultiRNNCell to stack the LSTMs into multiple layers.

Notice that we initialize all of our parameters using get_variable. This is necessary so that we can reuse these variables for each time step rather than creating new parameters at each step. Also, all parameters are transposed from the paper’s definitions to avoid additional graph operations.

Then we define each equation as operations in the graph. Many of the operations have reversed inputs from the equations so that the matrix multiplications produce the correct dimensionality. Other than these details we’re directly translating the equations.

Note that from a performance perspective, this is a naïve implementation. If you look at the source for TensorFlow’s LSTMCell you’ll see that all of the cell inputs and states are concatenated together before doing any matrix multiplication. This is to improve performance, however, since we’re more interested in taking the LSTM apart, we’ll keep things simple.

Running this vanilla LSTM on the included notebook we obtain a test perplexity (e^cost) of less than 100. So far so good. This will serve as our baseline to compare to the other variants. Below is the cost (average negative log probability of the target words) on the validation set after each epoch:

Vanilla cost on the validation set

Variants

The most helpful bits for implementing each of the variants can be found in appendix A3 of the paper. The gate omission variants such as no input gate (NIG), no forget gate (NFG), and no output gate (NOG) simply set their respective gates to 1 (be sure to use floats, not integers, here):

NIG sets i to 1, NFG sets f to 1 and NOG sets o to 1.

The no input activation function (NIAF) and no output activation function (NOAF) variants remove their input or output activation functions, respectively:

NIAF removes the g(x) activation function, while NOAF removes the h(x) activation function.

The no peepholes (NP) variant removes peepholes from all three gates:

For all three gates remove the peepholes.

The coupled input-forget gate (CIFG) variant sets the forget gate like so:

The final variant, full gate recurrence (FGR), is the most complex, essentially allowing each gate’s previous state to interact with each gate’s next state:

Recurrent connections are added for each of the gates.

In many of the variants, we can remove parameters no longer needed to compute the cell. The FGR variant, however, adds significantly more parameters (9 additional square matrices) which also increases training time.

To implement each, we’ll simply duplicate our vanilla LSTM cell implementation and make the necessary modifications for the variant. There are too many to show here but you can view the full source for each variant on Github. To train each, we’ll use the same hyperparameters from the vanilla LSTM trial. This probably isn’t fair and a more thorough analysis (as performed in the paper) would try to find the best hyperparameters for each variant.

Training progress of model variants.

Results

The NFG and NOG variants fail to converge to anything useful while the NIAF variant diverges significantly after around the 8th epoch. (This divergence could probably be fixed with learning rate decay which I omitted for simplicity.)

Diverging variants

In contrast, the NIG, CIFG, NP and FGR variants all converge. The NIG and FGR variants do not produce great results while the NP and CIFG variants perform similarly to the vanilla LSTM.

Converging variants

Finally the NOAF variant. Its poor performance is likely due to the lack of clamping from the output activation function so its cost explodes:

Here are the test perplexities for each variant:

Conclusion

Overall it’s been fun dissecting the LSTM. Feel free to try out the code yourself and if you’re interested in taking this further I recommend running comparisons with GRUs, looking at fANOVA or extending what’s here with more thorough analysis.

Follow me on Twitter for more posts like these. We also do applied research to solve machine learning challenges.

An LSTM Odyssey was originally published in Jim Fleming on Medium, where people are continuing the conversation by highlighting and responding to this story.

Highway Networks with TensorFlow

Jim Fleming — Tue, 29 Dec 2015 17:58:31 GMT

This week I implemented highway networks to get an intuition for how they work. Highway networks, inspired by LSTMs, are a method of constructing networks with hundreds, even thousands, of layers. Let’s see how we construct them using TensorFlow.

TL;DR Fully-connected highway repo and convolutional highway repo.

Implementation

For comparison, let’s start with a standard fully-connected (or “dense”) layer. We need a weight matrix and a bias vector then we’ll compute the following for the layer output:

Computing the output of a dense layer. (Bias omitted for simplicity and to match the paper.)

https://medium.com/media/6a7f893e3428ecdeae5ec8558f617d16/href

Here’s what a dense layer looks like as a graph in TensorBoard:

A dense layer in TensorBoard.

For the highway layer what we want are two “gates” that control the flow of information. The “transform” gate controls how much of the activation we pass through and the “carry” gate controls how much of the unmodified input we pass through. Otherwise, the layer largely resembles a dense layer with a few additions:

Computing the highway layer output. (Bias omitted for simplicity and to match the paper.)

An extra set of weights and biases to be learned for the gates.
The transform gate operation (T).
The carry gate operation (C or just 1 - T).
The layer output (y) with the new gates.

What happens is that when the transform gate is 1, we pass through our activation (H) and suppress the carry gate (since it will be 0). When the carry gate is 1, we pass through the unmodified input (x), while the activation is suppressed.

https://medium.com/media/69b56f2c29f3742ffecb589ca6e81cfd/href

Here’s what the highway layer graph looks in TensorBoard:

A highway layer in TensorBoard.

Using a highway layer in a network is also straightforward. One detail to keep in mind is that consecutive highway layers must be the same size but you can use fully-connected layers to change dimensionality. This becomes especially complicated in convolutional layers where each layer can change the output dimensions. We can use padding (‘SAME’) to maintain each layers dimensionality.

Otherwise, by simply using hyperparameters from the TensorFlow docs (i.e. no hyperparameter search) the fully-connected highway network performed much better than a fully-connected network. Using MNIST as my simple trial:

20 fully-connected layers fail to achieve more than 15% accuracy.
18 highway layers (with two fully-connected layers to transform the input and output) achieves ~95% accuracy. Which is also much better than a shallow network which only reaches 91%.

Now that we have a highway network, I wanted to answer a few questions that came up for me while reading the paper. For instance, how deep will the network converge? The paper briefly mentions 1000 layers:

In pilot experiments, SGD did not stall for networks with more than 1000 layers. (2.2)

Can we train with 1000 layers on MNIST?

Yes, also reaching around 95% accuracy. Try it out with a carry bias around -20.0 for MNIST (from the paper the network will only utilize ~15 layers anyway). The network can probably even go deeper since the it’s just learning to carry the last 980 layers or so. We can’t do much useful at or past 1000 layers so that seems sufficient for now.

What happens if you set very low or very high carry biases?

In either extreme the network simply fails to converge in a reasonable amount of time. In the case of low biases (more positive), the network starts as if the carry gates aren’t present at all. In the case of high biases (more negative), we’re putting more emphasis on carrying and the network can take a long time to overcome that. Otherwise, the biases don’t seem to need to be exact, at least on this simple example. When in doubt start with high biases (more negative) since it’s easier to learn to overcome carrying than without carry gates (which is just a plain network).

Conclusion

Overall I was happy with how easy highway networks were to implement. They’re fully differentiable with only a single additional hyperparameter for the initial carry bias. One downside is that highway layers do require additional parameters for the transform weights and biases. However, since we can go deeper, the layers do not need to be as wide which can compensate.

Here’s are the complete notebooks if you want to play with the code: fully-connected highway repo and convolutional highway repo.

Follow me on Twitter for more posts like these. We also do applied research to solve machine learning challenges.

Highway Networks with TensorFlow was originally published in Jim Fleming on Medium, where people are continuing the conversation by highlighting and responding to this story.

Loading TensorFlow graphs from Node.js

Jim Fleming — Fri, 04 Dec 2015 18:04:36 GMT

Check out the related post: Loading a TensorFlow graph with the C++ API.

Even though the full C API for TensorFlow is not yet available, we can still use it load TensorFlow graphs and evaluate them from other languages. This is incredibly useful for embedding pre-trained models in other applications. Embedding is one of the most interesting use cases for TensorFlow as it cannot be accomplished as easily with Theano.

Note that while all of the examples here will use Node.js the steps are nearly identical in any language with C FFI support (e.g. Rust, Go, C#, etc.)

Requirements

Install Bazel: Google’s build tool used to compile things for TensorFlow.
Clone the TensorFlow repo.

git clone --recursive https://github.com/tensorflow/tensorflow

Compiling a shared library

We’ll start by compiling a shared library from TensorFlow using Bazel.

UPDATE: The following build rule for creating a shared library is now part of TensorFlow: https://github.com/tensorflow/tensorflow/pull/695

Create a new folder in the TensorFlow repo at tensorflow/tensorflow/libtensorflow/.
Inside this folder we’re going to create a new BUILD file which will contain a single call to cc_binary with the linkshared option set to 1 so that we get a .so from the build. The name of the binary must end in .so or it will not work.

Here’s the final directory structure:

tensorflow/tensorflow/libtensorflow/
tensorflow/tensorflow/libtensorflow/BUILD

Below is the complete BUILD file:

https://medium.com/media/1b4a46704d499d219f682b5dca1d164f/href

From the root of the repository, run ./configure.
Compile the shared library with bazel build :libtensorflow.so and locate the generated file from the repo’s root: bazel-bin/tensorflow/libtensorflow/libtensorflow.so

Now that we have our shared library, create a new folder for the host language. Since this is for Node.js I’ll name it tensorflowjs/. This folder can exist outside of the TensorFlow repo since we now have everything needed in the shared library. Copy libtensorflow.so into the new folder.

If you’re on OS X and using Node.js you’ll need to rename the shared library from libtensorflow.so to libtensorflow.dylib. TensorFlow produces an .so however the standard on OS X is dylib. The Node FFI library doesn’t look for .so, only .dylib; however it can read both formats, so we just rename it.

Creating the graph

Just like with the previous C++ tutorial we’re going to create a minimal graph and write it to a protobuf file. (Be sure to name your variables and operations.)

https://medium.com/media/256c36ac384aa51eebef5844af26b285/href

Creating the bindings

Now we can go through the TensorFlow C API header, almost line by line, and write the appropriate binding. Most of the time this is fairly direct, simply copying the signature of the function. I also created variables for many of the common types so they were more legible. For example, any structs which map to void* I declared as variables named after the struct. We can also use the ref-array Node module which provides helpers for types like long long* (essentially an array of long long types) so we’ll define a LongLongArray type to correspond. Otherwise, we just copy the signature:

https://medium.com/media/b9f40ea7881902040d9b007e4bab1cac/href

I also defined a few helper functions to eliminate some of the boilerplate when working with the TensorFlow interface. The first is TF_Destructor, a default tensor destructor for TF_NewTensor. This comment in the TensorFlow source makes it sound like it’s optional but it’s not:

Clients can provide a custom deallocator function so they can pass in memory managed by something like numpy.

Additionally, many TensorFlow functions return a TF_Status struct and checking the status can get tedious. So I defined a function called TF_CheckOK that simply checks if the status code is TF_OK using TF_GetCode. If its not, we throw an error using TF_Message to hopefully get a useful error message. (This function loosely corresponds to TF_CHECK_OK in the TensorFlow source.)

And finally, reading a tensor with TF_TensorData only returns a pointer but to actually read the data we need to extend the returned Buffer to the appropriate length. Creating a Buffer with the correct size is a few lines of boiler plate so I wrapped TF_TensorData to create TF_ReadTensorData which handles that boilerplate for us. Here are the helpers:

https://medium.com/media/0b1969f862d7e06d2d6954743cb07521/href

Now that we’ve defined our interface the steps for loading the graph are the same as with C++:

Initialize a TensorFlow session.
Read in the graph we exported above.
Add the graph to the session.
Setup our inputs and outputs.
Run the graph, populating the outputs.
Read values from the outputs.
Close the session to release resources.

https://medium.com/media/3d0e0c01aa5564c2a835c3fdcd972a41/href

We can load and execute TensorFlow graphs from Node.js! I’ve put the whole thing together into a repo here (you’ll need to provide graph.pb and libtensorflow.dylib since they’re kinda large): https://github.com/jimfleming/tensorflowjs

Follow me on Twitter for more posts like these. We also do applied research to solve machine learning challenges.

Loading TensorFlow graphs from Node.js was originally published in Jim Fleming on Medium, where people are continuing the conversation by highlighting and responding to this story.

Loading a TensorFlow graph with the C++ API

Jim Fleming — Sat, 21 Nov 2015 20:07:02 GMT

Check out the related post: Loading TensorFlow graphs from Node.js (using the C API).

The current documentation around loading a graph with C++ is pretty sparse so I spent some time setting up a barebones example. In the TensorFlow repo there are more involved examples, such as building a graph in C++. However, the C++ API for constructing graphs is not as complete as the Python API. Many features (including automatic gradient computation) are not available from C++ yet. Another example in the repo demonstrates defining your own operations but most users will never need this. I imagine the most common use case for the C++ API is for loading pre-trained graphs to be standalone or embedded in other applications.

Be aware, there are some caveats to this approach that I’ll cover at the end.

Requirements

Install Bazel: Google’s build tool used to compile things for TensorFlow.
Clone the TensorFlow repo. Be sure to include submodules using the recursive flag (thanks to @kristophergiesing for catching this):

git clone --recursive https://github.com/tensorflow/tensorflow

Creating the graph

Let’s start by creating a minimal TensorFlow graph and write it out as a protobuf file. Make sure to assign names to your inputs and operations so they’re easier to assign when we execute the graph later. The node’s do have default names but they aren’t very useful: Variable_1 or Mul_3. Here’s an example created with Jupyter:

https://medium.com/media/256c36ac384aa51eebef5844af26b285/href

Creating a simple binary or shared library

Let’s create a new folder like tensorflow/tensorflow/ for your binary or library to live. I’m going to call the project loader since it will be loading a graph.

Inside this project folder we’ll create a new file called .cc (e.g. loader.cc). If you’re curious, the .cc extension is essentially the same as .cpp but is preferred by Google’s code guidelines.

Inside loader.cc we’re going to do a few things:

Initialize a TensorFlow session.
Read in the graph we exported above.
Add the graph to the session.
Setup our inputs and outputs.
Run the graph, populating the outputs.
Read values from the outputs.
Close the session to release resources.

https://medium.com/media/363f120546fba9e1303772b43cfde666/href

Now we create a BUILD file for our project. This tells Bazel what to compile. Inside we want to define a cc_binary for our program. You can also use the linkshared option on the binary to produce a shared library or the cc_library rule if you’re going to link it using Bazel.

https://medium.com/media/a2e2ec31a200fea40d00fa601a204d47/href

Here’s the final directory structure:

tensorflow/tensorflow/loader/
tensorflow/tensorflow/loader/loader.cc
tensorflow/tensorflow/loader/BUILD

Compile & Run

From the root of the tensorflow repo, run ./configure
From inside the project folder call bazel build :loader
From the repository root, go into bazel-bin/tensorflow/loader
Copy the graph protobuf to models/graph.pb
Then run ./loader and check the output!

You could also call bazel run :loader to run the executable directly, however the working directory for bazel run is buried in a temporary folder and ReadBinaryProto looks in the current working directory for relative paths.

And that should be all we need to do to compile and run C++ code for TensorFlow.

The last thing to cover are the caveats I mentioned:

The build is huge, coming in at 103MB, even for this simple example. Much of this is for TensorFlow, CUDA support and numerous dependencies we never use. This is especially true since the C++ API doesn’t support much functionality right now, as a large portion of the TensorFlow API is Python-only. There is probably a better way of linking to TensorFlow (e.g. shared library) but I haven’t gotten it working yet.
There doesn’t seem to be a straightforward way of building this outside of the TensorFlow repo because of Bazel (many of the modules needed to link to are marked as internal). Again, there is probably a solution to this, it’s just non-obvious.

Conclusion

Hopefully someone can shed some light on these last points so we can begin to embed TensorFlow graphs in applications. If you are that person, message me on Twitter or email. We also do applied research to solve machine learning challenges.

Loading a TensorFlow graph with the C++ API was originally published in Jim Fleming on Medium, where people are continuing the conversation by highlighting and responding to this story.

Complex types with Rust’s FFI

Jim Fleming — Thu, 09 Jul 2015 21:25:20 GMT

Interop with object methods, structs, and arrays

When I wrote about calling Rust functions from Unity3D , it was my first time working with a foreign function interface (FFI) and there was a lot I didn’t understand beyond calling simple functions with primitives.

How do I call methods? How do I pass arrays? How do I pass structs back and forth? Here’s what I’ve come up with…

Note that all of the examples below use Node.js. The principles are the same in Unity3D, C#, and other languages.

A quick note about usize

Often, marshaling between types is pretty straightforward: f64 to double, u64 to ulong, or simply i32 to int. Rust’s usize, however, turned out to be the most varied, and most ambiguous, type-mapping amongst host languages. The usize type represents an unsigned number the width of a pointer (like 32-bit or 64-bit). This varies by the host platform’s OS so while you could use a ulong or uint32 on your machine it might break elsewhere. Since Rust uses usize quite often for ranges and indices: always make sure to use a type that represents a platform-specific width. In Node.js you’ll want size_t and in C# (or Unity3D) UIntPtr seems to do the trick.

Working with methods

Since we’re effectively passing memory references around, the notion of an object with methods doesn’t really exist across the FFI boundary. To work around this limitation, we can define static functions that operates on pointers that we reinterpret as the original object. The host then holds this pointer and uses it when calling these functions.

Here’s a simple counter struct with increment and decrement methods that we’ll use as the basis for our examples:

https://medium.com/media/501ff3d905716ac9fd26e22435829af3/href

Now let’s add our FFI. At a minimum we need to provide:

A constructor — the constructor instantiates an object in memory and returns a pointer to it.
A destructor for the instantiated objects. We’re responsible for cleaning up memory allocated by the foreign language.
A function to act as a proxy for each method on the object that we want to call.

Here’s what that looks like:

https://medium.com/media/6ec27d545c15d715fb571517da78d0ae/href

We utilize Rust for memory allocation to create our counter on the heap, using Box, then transmute this box into a raw pointer. This trickery avoids having to manually allocate the memory and seems to be the most canonical way to allocate the counter. Our destructor works similarly by transmuting the counter’s pointer back into a Box then letting it automatically drop.

Finally, each function acting as a proxy takes a pointer as its first argument. The function converts this pointer to the original type and calls the desired method passing through any arguments, and finally returning the result (if any). Unlike our destructor, we don’t want to transmute back these pointers into a box until we’re ready to destroy it.

Calling the FFI is pretty straightforward, relying on the host language’s pointer type:

https://medium.com/media/524799f8b75f765039a92808d9d602d5/href

Working with structs

Sometimes functions may require a number of arguments. To avoid a complicated function signature, we can use configuration structs to group related arguments. Structs work well for this task because they can be described linearly in memory with a flat structure (matching the C struct definition) so passing a struct in and out of Rust is pretty straightforward. Classes, on the other hand, involve more indirection and, therefore, cannot be easily passed.

The main concern for the host language is the memory layout of the struct properties. Dynamic languages like Node.js provide tools for defining structs with the appropriate layout. In C# you can use the StructLayout attribute with LayoutKind.Sequential.

In this example, the counter is modified to accept a configuration struct containing the initial value and the amount to increment and decrement by:

https://medium.com/media/05d180992dc7c78db077d6405bd5e8c7/href

With the FFI, Rust handles the struct conversion directly so we don’t need to do anything special:

https://medium.com/media/e8ff6aa31e896cb9fce0f2e10f384ea2/href

In Node, we define a matching struct type for Args and use it in our interface specification:

https://medium.com/media/d4ce1636b2ac19359fe7eae35fe12f0d/href

Working with arrays

Passing an array turns out to be the least straightforward of the three techniques since we cannot simply pass the array back and forth like we can with pointers or structs. An array can most generally be represented by a pointer to the first element in the array and a length so that’s what we’ll use.

Another issue is ownership: who owns the array’s memory? The safest option is to let the host be responsible for the memory since it has the most information about how the memory should be freed. You pass an array in, manipulate it in place and then, instead of returning the array, the caller can simply read its contents when the function is complete.

The array type in Rust must have a known length at compile time so we need to use a slice, or a “view” into an array, which we’ll sum into our counter:

https://medium.com/media/b740b44e73af4eead7f2d77e258bd753/href

In the FFI we need a pointer to the first value in the slice and its length. Then we can use std::slice::from_raw_parts to reassemble the slice (or std::vec::Vec::from_raw_parts to create a vector).

https://medium.com/media/f811477cabfd6867e2bd195969abc6c3/href

From the host language we can simply specify an array type as the argument:

https://medium.com/media/d809de5717a73791e27a4ed7088e2cfd/href

A better interface

To make things even cleaner, let’s wrap up our host FFI into a class that exposes a more natural interface. Most importantly we can hide the use of the pointer since the caller should not need to worry about it (and misuse of the pointer can cause errors or unexpected behavior).

https://medium.com/media/0468d0437d7d7bb8efc6c3befb03bf46/href

Conclusion

And that’s it! You can play around with the code samples on Github.

Below are some of the resources I used when researching how to do the things in this post.

If I got anything wrong or if you have any questions please let me know via Twitter or email.

References

Complex types with Rust’s FFI was originally published in Jim Fleming on Medium, where people are continuing the conversation by highlighting and responding to this story.

Rust(lang) in Unity3D

Jim Fleming — Mon, 08 Jun 2015 23:52:58 GMT

How to use Unity’s Native Plugin interface to call fast, safe code in Rust

Lot’s of people are excited about Rust for its applications to game development. Writing native plugins in Unity3D usually means C, C++ or Objective-C and no real memory safety within the underlying code. Now that Rust has hit 1.0 I looked into calling Rust from Unity3D and it turns out to be surprisingly simple.

I should note that this guide targets OS X, not Windows. The process should be similar, likely substituting “dll” for each “dylib”. Check the referenced Unity Native Plugin guide for Windows-specifics.

From Rust

Let’s start with two simple Rust functions that return their doubled and tripled their integer inputs, respectively:

https://medium.com/media/73f4fe4b56c94f85b2d547138266c339/href

Rust uses a simple config file to define the build output and a command-line tool called Cargo to perform the actual builds. Since Unity loads libraries dynamically we want to specify “dylib” as our “crate-type” in our Cargo manifest:

https://medium.com/media/76382b410828be7e68e8e7c236ebb022/href

Next, we run cargo build and we’re done. Pretty much the most straightforward compile process I can imagine.

Inside our target/debug/ (or target/release/ for release builds) folder should be a file named lib.dylib where name corresponds to the lib name property in the config file above.

From Unity3D

We want to copy this library into our Unity project’s Assets/Plugins folder. On OS X, Unity expects native plugin’s to have a “.bundle” extension. We can simply rename our compiled lib’s extension from “.dylib” to “.bundle”. This works because the underlying command that loads the library understands both formats.

In Unity, we create a MonoBehaviour (a regular class works too), add a static extern function interface and tag it with the DLLImport attribute pointing to the name of our library in Assets/Plugins (omitting the extension):

https://medium.com/media/1df8acc2761db4194a9165bb44ed3acc/href

If all goes well, playing Unity3D should produce the following output:

https://medium.com/media/b514e1c65f556dfcd31f4fa212d09de5/href

Conclusion

I’ve avoided writing native plugins for my games in the past due to the inherent complications around memory handling in production code. Rust makes it easier to write fast, safe code to be run within Unity, giving us an alternative over C, C++ or Objective-C for areas of high-performance code.

If you’re interested in working with more complex types via Rust’s FFI then I’ve written a follow up on my experiences that covers object methods, structs and arrays.

I’m happy to answer any questions on Twitter: @jimmfleming

References

Complete project for Unity5: https://github.com/jimfleming/unity-to-rust
Rust Once, Run Everywhere: http://blog.rust-lang.org/2015/04/24/Rust-Once-Run-Everywhere.html
Rust FFI documentation: https://doc.rust-lang.org/book/ffi.html
Rust FFI examples for other languages (Node, Python, Ruby, C, etc.): https://github.com/alexcrichton/rust-ffi-examples
Cargo manifest documentation for dynamic and static libraries: http://doc.crates.io/manifest.html#building-dynamic-or-static-libraries
Unity Native Plugins documentation: http://docs.unity3d.com/Manual/NativePlugins.html

Rust(lang) in Unity3D was originally published in Jim Fleming on Medium, where people are continuing the conversation by highlighting and responding to this story.