Porting a model to TensorFlow
TensorFlow is a recent addition to a constellation of frameworks designed to accelerate the process of building deep models. Of these, I’ve only previously had time to learn Theano — one of the more venerable frameworks in this fairly young field. Just like TensorFlow, Theano uses Python as a meta-programming language to describe a model symbolically — add these two tensors together; take the average of this over that axis — and then differentiate it to ensure that parameter optimisation is tractable and efficient. Unlike TensorFlow however, Theano requires end-users to re-implement lots of the standard components that often crop up in models — especially optimisers, cost calculations and things like dropout (often used in combination with softmax). Both libraries also allow a mix of GPU and CPU code generation. But aside from style, which framework helps you get results fast? Which is easier to program? Which produces models which train quickly? TensorFlow offers a nice API, lots of built in functionality and a robust device model and integrated cluster awareness, whereas Theano offers more aggressive compute graph optimisation and lots of tuneable parameters. So which on balance provides the best blend of performance and productivity?
Introducing Dracula
To find out, I decided to port a reasonably simple natural language model — called Dracula — from Theano to TensorFlow. Dracula is an LSTM-based sequence tagger (the initial application is part-of-speech tagging) that operates at the level of individual characters, making its models small and reducing the need for preprocessing operations like thresholding rare words, building dictionaries of proper nouns, filtering emoticons and other pre-processing that’s unlikely to help performance outside the validation environment, meaning it’s good for less-structured environments like Twitter.
Dracula has a couple of main steps:
- Generating character embeddings inputs;
- Applying zero or more LSTM layers per word, at the character level;
- Mean-pooling the character-level representations into character-level representations;
- Applying zero or more LSTM layers per sentence, at the word-level;
- Applying softmax to generate the final per-word labels.
This makes it just tractable enough for a reasonably short port, but just complicated enough to hit some unusual TensorFlow API corners.
Going out of the box
Theano needs a fairly standard set of tools and is easily installed via pip. TensorFlow is just as straightforward: just follow the instructions. However, when I first evaluated TensorFlow several months ago, I quickly found out that they’d hard-coded a dependency on the CUDA 7.0 SDK into their build files, making it very difficult to run on system using a later version of CUDA. In the last few weeks, the TensorFlow team have released version 0.8, which corrects that problem. It also now supports Python 3 — which I’m gradually switching to — but for the rest of this post I’m using 2.7.
Once installed, I tested it out using an example script from Google, reproduced below:
Here’s a rough equivalent in Theano:
These two little scripts actually encapsulate a great deal of the difference between the two frameworks. Each starts off with two identically-generated numpy arrays describing the input, and the target output (it’s mathematically z = 0.1 * x + 0.2 * y). Next, comes model initialisation:
Note that TensorFlow doesn’t require any special treatment of the x and y variables: they just exist in their native forms. Theano meanwhile requires some additional plumbing to say that the variables are symbolic inputs to the function. The syntax defining what the b and W variables look like is also, in my opinion, a little nicer.
Next up is the way of actually learning: gradient descent.
- These lines are all about specifying what we’re minimising, in this case mean squared error. The syntax is similar.
- Lines marked (2) are about setting up the code to compute the gradients with respect to each of the variables we’re learning, and changing the weights in response to a new gradient. TensorFlow gives access to lots of good optimisers out of the box (including gradient descent and Adadelta). Theano makes me do all the hard work. This is both a good and bad thing: it offers ultimate control, but also affects code comprehension and increases verification effort.
- Finally, we form a training function that zips together everything previously so we can update it.
Finally, we get to the body of the training itself:
Again, the function of these fragments is essentially identical: but TensorFlow’s philosophy of encapsulating the graph execution in the Session object does feel conceptually cleaner than Theano’s.
Porting the embeddings matrix
Step 1 of Dracula is to take two arrays per batch as input: one represents the character indices for each tweet, and the other describes how the characters are laid out into words that we want to classify. Here’s what the embeddings matrix code looks in Theano:
And here’s how it looks in TensorFlow:
TensorFlow doesn’t have a flatten operation like Theano does (for those looking — pass -1 into the reshape function’s shape parameter), but actually that turned out not to be necessary: gather does the equivalent operation by slicing up the embeddings matrix Wemb into slices defined by the matrix of indices x. Whilst TensorFlow’s written documentation is generally excellent — if a bit confusing — it’s online documentation is especially excellent: whilst I was groping around for gather, TensorFlow gave me really excellent error messages to help me figure out what was going on. Since I first ported some of the code, 0.8 introduces a new function which improves handling of very large embeddings matrices distributed across multiple machines.
Aside: testing TensorFlow
As I port each layer of Dracula over to TensorFlow, I’m adding and revising tests as I go along. I use Python’s unittest module because it’s convenient and built-in, but it’s more interesting to discuss how TensorFlow fits in. Here’s the test for the above:
Straightforward stuff, set up some numpy arrays, declare the embeddings matrix as a tf.Variable (since in the model it’s learned) and declare the input index array as a constant for evaluation. We then run through the session boilerplate, execute the embeddings_layer function to get a Tensor, and then use eval to turn it into a numpy array for assertion.
Per-word averaging / max pooling
Per-word averaging is a convolution operation that takes the embeddings or letter-level LSTM output for each character, and turns it into a discrete word embedding that can then be fed to higher recurrent layers. Getting this bit right was the hardest part of creating the first version of the model and underwent several revisions to improve performance. The final version uses a (max word index, max char index, batch size, embedding size)-sized tensor from the embeddings layer and then averages out the second dimension, leaving a 3D (max word index, batch size, embedding size) tensor. The mask is used to communicate to the algorithm which bits of the input are defined (i.e. covered by actual letters in the input), so the input is masked off via element-wise multiplication and then summed to produce the divider. The divider is normalised so that it contains no zeros (if a zero’s on the bottom, then there should be a zero on top too). The per-dimension total is then element-wise divided by this result. Theano’s code is relatively straightforward:
The dimshuffle calls aren’t actually unnecessary, and I picked this up during the port (they’ll be removed in a future version). Note than in Theano, operations like sum are often attached to the tensors themselves, indicating that the result is derived from the previous step in the computational graph. Operations which bring multiple parts of the compute graph together (like eq) are still defined within Theano’s main namespace. Tensorflow’s version looks very similar, but moves everything into the main namespace.
Softmax and dropout
The softmax layer is used to produce a probability distribution across the possible labels. Both frameworks assume that the input to softmax is 2D, so we have to use the frameworks’ iteration constructs to scan across each 2D slice of the 3D output of Dracula. Both final softmax layers use dropout during training to try to fight overfitting.
Again, there’s an issue with undefined input, which should always be given a particular label. The way this is accomplished by Dracula in Theano is to create a default tensor with the probability distribution of the undefined label, and then setting the defined areas to the output of softmax. Theano’s scan function is used to convert the (max word index, batch size, embedding size) input into a (batch size, embedding size) slice.
Tensorflow’s implementation is a little more complicated because I opted to use the same implementation to produce both the probability distributions and the raw activations for computing the cost. This is needed because of the cost function I chose, which for some inexplicable reason does softmax internally (“for efficiency”).
Overall it’s very straightforward: TensorFlow has a built in drop-out function, and map_fn takes care of creating the 2D slices. Because TensorFlow’s indexing capability is limited, it’s not possible to mask undefined regions of the input in TensorFlow like in Theano: but it’s possible to work around it in this case by using the correct cost function.
Gluing it together with LSTM layers
Dracula supports a variable number of LSTM layers at both the word and character levels to perform sequence tagging. This — more than anywhere else — is where the port differs from Theano. The Theano implementation applies the modifed form of the LSTM presented in the sentiment analysis demo multiple times, and supports both 2D and 3D versions (thanks to broadcasting). The forward and backward iterations are summed to produce the final output.
The TensorFlow version has only one single activation weight and bias, and the multiple layers are natively supported within the framework. TensorFlow requires two 2D scans (both per-word and per-letter) to compute the output.
Cost function
This was surprisingly one of the more difficult pieces to port. In Theano, the cost is computed with via mean squared-error based on the negative log probabilities, and looks like this:
TensorFlow’s got a function which does much the same thing, and (guess what!) it only operates on 2D tensors. So to port this code, it’s necessary to iterate in much the same way, with the caveat that indexing via lists, TensorFlow Variable objects and other symbolic tensors is not supported. As a result, it’s not possible do a map_fn on a tf.range and access the labels and computed probabilities within the scanning function like in Theano, and it’s not possible to pass more than one tensor in either.
The solution was to combine the tensors representing the reference labels (y) and the unscaled activations (pred_logits) are combined and then unpacked within the map routine. This is a little gross, but it works.
Supporting code and infrastructure
Instead of needing numpy to save Theano’s parameters, TensorFlow has the capability to save tensor values built-in, so that’s great. The only catch is that all variables must be specified in advance (otherwise TensorFlow will complain when model parameters are reloaded), meaning that hyper-parameter optimisations where layers are added or removed have to be considered in advance.
Saving the state of the graph is as easy as creating a tf.train.Saver object with some tf.Variables, including it in the session and then calling it’s save method. The Saver object also supports incremental checkpointing, making it a good fit for environments where training can stop and start (like Amazon EC2 Spot Instances).
Performance
TensorFlow has acquired a reputation for being sluggish, and that certainly bears out in my testing. For the initial pre-training stage (with no LSTM layers, just embeddings, averaging and softmax), TensorFlow takes 163 seconds per epoch to train on my single NVidia GTX 980 consumer-class GPU. With letter layer LSTMs, it takes 5503 seconds per epoch. In Theano, the first stage takes 95 seconds per epoch and the second requires 1709, despite doing more work in the LSTM layers. Whilst there is likely some optimisation work to do, a significant amount of this difference is probably down to Theano’s more aggressive compilation and optimisation strategy, which can result in long startup times for the model. Without optimisation enabled, the model starts up as quickly under Theano as it does under TensorFlow. The good news is that TensorFlow’s performance is improving, and for certain workloads it’s quite competitive with other specialized frameworks. But for single-machine workloads, it’s quite hard to recommend it at this stage.
Conclusion
The big unknown is whether Theano can turn into TensorFlow (e.g. by adding multi-GPU support) faster than TensorFlow can turn into Theano. I wouldn’t count on it: with a fully managed cloud platform already available, its growing adoption inside Google, custom hardware, Google’s significant developer resources and mind-share, and the promise to bring the core engine to other platforms, there’s no chance of TensorFlow going anywhere in the near future. Right now, assuming the compute time involved in training is cheap, there may be small productivity improvements under TensorFlow — it has built-in optimizers, more built in neural network capabilities, and tightly integrated check-pointing — but its performance and its potential just hasn’t arrived yet. Whilst I’m not officially releasing or supporting the code behind this article (which definitely bears the scars of battle), I’m going to keep it around and update it for TensorFlow 0.9 (hopefully due soon).