Neural Net in 10 Frameworks (Lessons Learned)

7 min readSep 6, 2017

GitHub Project Link — Deep Learning Frameworks

Beyond all the technical elements, what I found the most interesting about this project was the amazing contributions it gathered from the open-source community. The pull-requests and issues raised by the community helped immensely to unite all the frameworks in terms of accuracy and also training-time. It was amazing to see the contributions from FAIR-researchers, original creators of frameworks (such as Yangqing) and other GitHub users to a Microsoft employee’s GitHub, and this wouldn’t have been possible for me without them. Not only were code suggestions offered, but also whole notebooks for different frameworks were supplied!

You can see the original state of the repo here, before the contributions.

The issue

A search for Tensorflow + MNIST produces this complicated-looking tutorial, which avoids higher-level APIs (tf.layers or tf.nn) and doesn’t seem detached enough from the input-data that one would feel comfortable replacing MNIST with CIFAR (for example). Some tutorials, to avoid the verbose loading of MNIST, have a custom wrapper like: framework.datasets.mnist, however I have two issues with this:

For a beginner it may not be obvious how to re-run on their own data
Comparing this to another framework may be more tricky (does pre-processing differ?)

Other tutorials save MNIST to disk as a text-file (or even a custom database) and then load it again, this time using some TextReaderDataLoader. The idea is to show what would happen if the user had a huge data-set, that was too big to load into RAM and required lots of on-the-fly-transformations. For a beginner it may be misleading and intimidating and I often hear the question: “Why do I need to save it, I have it right here as an array!”

The goal

The goal was to show how to write the same neural-net (on a common, custom data-set) using the 8 most popular frameworks - a Rosetta Stone of deep-learning frameworks, to allow data-scientists to easily leverage their expertise from one framework to another (by translating, rather than learning from scratch). A consequence of having the same model in different frameworks is that the frameworks become more transparent in terms of training-time and default-options and we can even compare certain elements.

By being able to quickly translate your model to another framework means you can swap hats if another framework has a layer you may need to write from scratch, deals with your data-source in a more efficient manner, or is more suited to the platform you are running on (e.g. Android).

For these tutorials, I try to use the highest-level API possible conditional on being able to override conflicting default options, to allow an easier comparison between frameworks. This means that the notebooks are not specifically written for speed.

It will demonstrated that the code structure becomes very similar once higher-level APIs are used and can be roughly represented as:

Load data into RAM; x_train, x_test, y_train, y_test = cifar_for_library(channel_first=?, one_hot=?)
Generate CNN symbol (usually no activation on final dense-layer)
Specify loss (cross-entropy often bundled with softmax), optimiser and initialise weights + maybe sessions
Train on mini-batches from train-set using custom iterator (common data-source for all frameworks)
Predict on fresh mini-batches from test-set, perhaps specifying test-flag for layers like drop-out
Evaluate accuracy

Some Caveats

Since we are essentially comparing a series of deterministic mathematical operations (albeit with a random initialization), it does not make sense to compare the accuracy across frameworks and instead they are reported as checks we want to match, to make sure we are comparing the same model architecture.

The reasons I mention that a speed comparison doesn’t make much sense are because:

Using native data-loaders would likely shave off a few seconds (only) since the shuffling would be performed asynchronously. However, for a proper project, your data is unlikely to fit into RAM and also may require lots of pre-processing and manipulation (data augmentation). This is what data-loaders are for. Yangqing mentions:

We’ve experienced I/O being the main bottleneck for several of our in-production networks, so it would be nice to notify people that when one cares about top performances, using asynchronous I/O would help a lot

Only a few layers are used in this example (conv2d, max_pool2d, dropout, fully-connected). For a proper project you may have 3D convolutions, GRUs, LSTMS, etc.
The ease of adding your own custom layers (or perhaps the availability of layers such as k-max-pooling or hierarchical softmax), along with the speed at which they run, can make or break your choice of framework. Being able to write a custom-layer in python-code and having it execute quickly is vital for research projects
In reality you want to make use of advanced logging (such as tensorboard) to see if your model is converging and also to assist with hyper-parameter tuning. In this example we take all that as exogenous.

Lessons Learned (matching accuracy/time)

The below offers some insights I gained after trying to match test-accuracy across frameworks and from all the GitHub issues/PRs raised.

The above examples (except for Keras), for ease of comparison, try to use the same level of API and so all use the same generator-function. For MXNet and CNTK I have experimented with a higher-level API, where I use the framework’s training generator function. The speed improvement is neglible in this example because the whole dataset is loaded as NumPy array in RAM and the only processing done each epoch is a shuffle. I suspect the framework’s generators perform the shuffle asynchronously. Curiously, it seems that the frameworks shuffle on a batch-level, rather than on an observation level, and thus ever so slightly decreases the test-accuracy (at least after 10 epochs). For scenarios where we have IO activity and perhaps pre-processing and data-augmentation on the fly, custom generators would have a much bigger impact on performance.
Enabling CuDNN’s auto-tune/exhaustive search parameter (which selects the most efficient CNN algorithm for images of fixed-size) has a huge performance boost. This had to be manually enabled for Caffe2, PyTorch and Theano. It appears CNTK, MXNet and Tensorflow have this enabled by default. I’m not sure about Chainer. Yangqing mentions that the performance boost between cudnnGet (default) and cudnnFind is, however, much smaller on the Titan X GPU; it seems that the K80 + new cudnn makes the problem more prominent in this case. Running cudnnFind for every combination of size in object detection has serious performance regressions, however, so exhaustive_search should be disabled for object detection
When using Keras it’s important to choose the [NCHW] ordering that matches the back-end framework. CNTK operates with channels first and by mistake I had Keras configured to expect channels last. It then must have changed the order at each batch which degraded performance severely.
Tensorflow, PyTorch, Caffe2 and Theano required a boolean supplied to the pooling-layer indicating whether we were training or not (this had a huge impact on test-accuracy, 72 vs 77%).
Tensorflow was a bit annoying and required two more changes: speed was improved a lot by enabling TF_ENABLE_WINOGRAD_NONFUSED and also changing the dimensions supplied to channel first rather than last (data_format=’channels_first’). Enabling the WINOGRAD for convolutions also, naturally, improved Keras with TF as a backend
Softmax is usually bundled with cross_entropy_loss() for most functions and it’s worth checking if you need an activation on your final fully-connected layer to save time applying it twice
Kernel initializer for different frameworks can vary (I’ve found this to have +/- 1% effect on accuracy) and I try to specify xavier/glorot uniform whenever possible/not too verbose
Type of momentum implemented for SGD-momentum; I had to turn off unit_gain (which was on by default in CNTK) to match other frameworks’ implementations
Caffe2 has an extra optimisation for the first layer of a network (no_gradient_to_input=1) that produces a small speed-boost by not computing gradients for input. It’s possible that Tensorflow and MXNet already enable this by default. Computing this gradient could be useful for research purposes and for networks like deep-dream
Applying the ReLU activation after max-pooling (insteaad of before) means you perform a calculation after dimensionality-reduction and thus shave off a few seconds. This helped reduce MXNet time by 3 seconds
Some further checks which may be useful:

specifying kernel as (3) becomes a symmetric tuple (3, 3) or 1D convolution (3, 1)?
strides (for max-pooling) are (1, 1) by default or equal to kernel (Keras does this)?
default padding is usually off (0, 0)/valid but useful to check it’s not on/’same’
is the default activation on a convolutional layer ‘None’ or ‘ReLu’ (Lasagne)
the bias initializer may vary (sometimes no bias is included)
gradient clipping and treatment of inifinty/NaNs may differ across frameworks
some frameworks support sparse labels instead of one-hot (which I use if available, e.g. Tensorflow has f.nn.sparse_softmax_cross_entropy_with_logits)
data-type assumptions may be different — I try to use float32 and int32 for X and y but, for example, torch needs double for y (to be coerced into torch.LongTensor(y).cuda)
if the framework has a slightly lower-level API make sure during testing you don’t compute the gradient by setting something like training=False

Neural Net in 10 Frameworks (Lessons Learned)

The issue

The goal

Some Caveats

Lessons Learned (matching accuracy/time)

Written by Ilia Karmanov