We Need to Go Deeper: A Practical Guide to Tensorflow and Inception
I’ve nursed a side interest in machine learning and computer vision since my time in graduate school. When Google released its Tensorflow framework and Inception architecture, I decided to do a deep dive into both technologies in my spare time.
The Inception model is particularly exciting because it’s been battle-tested, delivering world-class results in the widely-acknowledged ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It’s also designed to be computationally efficient, using 12x fewer parameters than other competitors, allowing Inception to be used on less-powerful systems.
I wrote this series because I couldn’t easily bridge the gap between Tensorflow’s tutorials and doing something practical with Inception. Inspired by Inception’s own origins, this tutorial will “go deeper,” presenting a soup-to-nuts tutorial using Inception to train a MNIST (hand-written digits) classifier. While the goal isn’t to get to world-class performance, we’ll get a model that performs at >99% accuracy.
Who Should Use this Tutorial Series?
This is a practical introduction, so it’s not focused on the theories that underly neural networks, computer vision, or deep learning models (though there will be a few remarks about the general motivation behind Inception in Part 2).
Instead, this tutorial is aimed at folks who have done the basic Tensorflow tutorials and want to “go a little deeper” to apply Inception to their own projects. Think graduate student embarking on a project, or a software engineer who’s been asked to build a training and inference pipeline.
The tutorial is roughly divided into 4 parts:
- Part 1: Using Slim to Build Deep Architectures
Deep architectures are complicated beasts. Slim is a library that can help you tame the complexity.
- Part 2: Introduction to Inception
How is Inception put together? What is its rough architecture?
- Part 3: Training Inception on a Novel Dataset
How do I apply Inception to my dataset? How do I use Tensorboard to monitor the training?
- Part 4: Using Inception for Inference
I’ve trained. How do I predict new cases?
Obviously, you’ll need a workstation with Tensorflow installed. One of the easiest ways I’ve found is to use Docker and just grab the latest image. This works well for smaller projects and experimentation.
Training deep learning models like Inception is so computationally intensive that running it on your laptop is impractical. Instead, you’ll need access to a GPU. A few options:
- Build your own: I’ve done most of my training on Amazon using a
p2.xlargeGPU instance that I built from scratch. For that, you’ll need to build Tensorflow with Nvidia’s drivers. Directions on how to do that are here.
- Use a pre-built AMI: Amazon has a pre-built AMI with a variety of pre-packaged frameworks (MXNet, Caffe, Tensorflow, Theano, Torch and CNTK).
- Paperspace: For those that don’t want to futz with building their own box or the hassle of running a server, Paperspace offers a GPU-enabled linux desktop box in the cloud that is purpose-built for machine learning. Signup is fast and easy and gives you access to a desktop computing environment through your browser.
I’ve also created a Github repository with code samples that we’ll use in the series. Check it out here.
tensorflow-tutorial - Code and examples for Initialized's Introduction to Tensorflow Tutorial
Using Slim to Build Deep Architectures
At the core of Tensorflow is the notion of a computational graph. Operations in our neural network (e.g., convolution, bias adding, dropout, etc..) are all modeled as nodes and edges in this graph. Defining an architecture for a learning task is tantamount to defining this graph.
Tensorflow provides many primitives for defining these graphs and if you’ve run through the introductory tutorials you’ve undoubtedly encountered them when constructing simple neural net architectures. However, as the complexity of our architecture grows, these simple primitives become cumbersome.
Slim is Tensorflow library that bundles commonly used building blocks like convolution and max pooling. Access it by simply importing it:
import tensorflow.contrib.slim as slim
Using slim makes it simple to chain multiple building blocks together. For instance, the following code will create a neural network with two layers, each with 256 hidden units:
net = slim.fully_connected(input, 256, scope='layer1-256-fc')
net = slim.fully_connected(net, 256, scope='layer2-256-fc')return netinput = load_data()
output = my_neural_network(input)
Slim will do all of the heavy lifting; it defines the appropriate weight and bias variables and links them in the appropriate way. Even more conveniently, Slim does all of this under a named scope that you provide allowing you to navigate your architecture in Tensorboard.
NielsenNet: A Guided Example
As a simple example of how Slim builds more complicated architectures, consider the MNIST classifier that is presented in Chapter 6 of Michael Nielsen’s wonderful textbook “Neural Networks and Deep Learning.” The neural network, which I’ve christened “NielsenNet” consists of:
- An 28x28 input representing a monochrome image of a handwritten digit
- A convolution layer with 20 kernels, stride=1, size=5, followed by 2x2 max pooling. The convolution layer is padded to maintain the spatial dimensions.
- Another convolution layer with 40 kernels, stride=1, size=5, again followed by 2x2 max pooling. This time the input is not padded to maintain dimensions.
- A fully-connected layer of 1000 hidden units with dropout
- Another fully-connected of 1000 hidden units, again with dropout
- An output layer of 10, corresponding with the 10 output classes for the MNIST classification problem.
This architecture is implemented here using slim:
To see how this code is used, I’ve created a Jupyter notebook that trains the NielsenNet against the included MNIST dataset for 100k steps. It ends with an accuracy of approximately 99.51%. Not bad for such a simple network!
Using Slim to Build Model Architectures — Jupyter notebook that demonstrates how to use slim to build deep learning architectures
Using Tensorboard, we can even visualize the graph that’s created, giving you an overview of your architecture and how all of the major pieces connect.
One nice feature of using slim is that your basic building blocks are automatically associated within a named scope, which makes it easy to visualize the overall structure and connectivity of your neural network architecture.
This will become important as the complexity of the models we tackle grow.
Introduction to Inception
Inception was developed at Google to provide state of the art performance on the ImageNet Large-Scale Visual Recognition Challenge and to be more computationally efficient than its competitor architectures. However, what makes Inception exciting is that its architecture can be applied to a whole host of other learning problems in computer vision.
This tutorial focuses on retraining Inception on our old friend, the MNIST dataset. The goal is not to attain world-class performance on digit classification (in fact, using Inception is probably overkill) but to get experience on a known problem. In other words, it’s a toy problem but running through the effort will give us a good good experience to tackle other problems.
High Level Overview of the Inception Architecture
Below is an overview of the Inception architecture that I’ve liberated from the README. I’ve added some annotations which will allow you to get your bearings when you’re looking at Tensorboard histograms or digging into the code.
From 10,000 feet, Inception is basically:
- An 299x299x3 input representing a visual field of 299 pixels and 3 color (RGB) channels
- Five vanilla convolution layers, with a few interspersed max-pooling operations
- Successive stacks of “Inception Modules”
- A softmax ouput layer at the end (
logits) and at an intermediate output layer (
aux_logits) just after the
It’s the repeated stacking of the Inception modules that makes this architecture “deep.” At first glance, this architecture doesn’t seem very complicated — after all, we’re just stacking these modules together, right? However, this simple description belies the complexity and nuance hidden inside each Inception module.
While stacking Inception modules leads to depth, each module is also “wide” and architected to recognize features at multiple length scales. In the language of convolutional neural networks, that means introducing convolutions with several filter sizes; in Inception, that means including 3x3 and 5x5 convolutions in each stacked module:
The downside, of course, is that these convolutions are expensive, especially when repeatedly stacked in a deep learning architecture! To combat this problem, Inception’s architects stacked 1x1 convolutions in front of the expensive 3x3 and 5x5 convolutions to reduce the dimensionality before each convolution:
While I won’t get into the details, the idea of using 1x1 convolutions (and factoring the 3x3 and 5x5 convolutions into smaller convolutions) to reduce the computational cost is further developed in this article (specifically: “Factorizing Convolutions with Large Filter Size”).
Overall, these approaches drastically reduce the computational cost of Inception relative to other architectures — “only” 5M parameters (compared to AlexNet and VGGNet which use 12X and 36X more parameters, respectively). In turn, this reduction opens up the possibility of employing Inception on platforms with lower available resources.
Training Inception on a Novel Dataset
Now that we’ve taken a quick introduction, let’s tackle a well-known problem: classifying hand-written digits from the MNIST dataset. Luckily, Tensorflow includes this dataset (55k training, 5k validation, and 10k test) as numpy arrays representing the images and labels:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)mnist.train.images # Shape: 55000 x 784
mnist.train.labels # Shape: 55000 x 10
Note: This tutorial is computationally intensive so it’s probably not practical to run the training on your laptop. If you need a GPU, here are some directions on building one on Amazon’s EC2.
Training an Inception Model on the MNIST Dataset
Training Inception on your dataset is relatively straightforward since most of the tasks for building the dataset and training are already codified as Bazel tasks, which I won’t repeat in this tutorial:
Caveats: If your new dataset has a different number of classes than the base ImageNet case, you’ll have to edit
imagenet_data.py to account for this change since that parameter is hardcoded:
"""Returns the number of classes in the data set."""
return 10 # Change this line to 10 for the MNIST dataset
Also, Inception expects square inputs of 299x299, so keep that in mind when creating your dataset.
Using Tensorboard to Monitor Training
One of the most powerful features of Tensorflow is that it can collect and export data about your training; it also includes Tensorboard, a tool that allows you to visualize this exported data. Since the Inception model is implemented using Slim, every single layer of the architecture is instrumented with a variety of scalar, distributions, and histogram charts.
The annotations that I’ve made on the Inception architecture overview will help you navigate the Tensorboard output and allow you to examine how training is progressing in specific parts of the Inception architecture.
While I won’t give a deep tutorial on how to read and interpret these graphs, there are a few things I’ll point out. You’re probably most interested in the
total_loss graph, as this directly tracks how well your training is progressing and includes regularization losses:
Beyond scalar graphs, Tensorboard makes available histograms (and distributions) that allow you visualize in great detail what is happening within individual Inception modules. Such histograms can reveal how your weights and biases are shifting over time, and how your model is converging.
Using Inception for Inference
Most people, understandably, are interested in applying their trained models to new test cases. While there is a Bazel task included to evaluate a trained Inception model, I found it to be cumbersome since it relies on processed TFRecords. Instead, what most people are probably interested in is a simple way to classify test cases represented as raw
Evaluating a Trained Inception Model — Jupyter Notebook that demonstrates prediction of MNIST test cases into a trained Inception model
To aid understanding, I’ve gone ahead and extracted the relevant pieces into a separate Jupyter notebook. The notebook creates an instance of an Inception model whose weights and biases have been loaded from a checkpoint file. To evaluate the accuracy of the training, batches of MNIST test data are fed to the instantiated Inception model for prediction. These predictions are then compared against the annotated data and incorrectly predicted cases are collected and visualized for analysis.
The code in a notebook could easily be adapted for use in a production environment — e.g., serving Inception predictions from a Flask webserver or accepting input from a camera.
How Well Did the Model Perform?
Though the goal of this tutorial was not to achieve world-class accuracy in classifying hand-written digits (as mentioned previously, Inception is overkill), it’s nice to get a feel for how well it does. I trained Inception against the 55k training images in the MNIST dataset for approximately 87 epochs (36 hours on an EC2
p2.xlarge GPU instance). This yielded an accuracy of 99.05%— not world-class, but a good indication that we’re on the right track!
While 99.05% is not too shabby, it’s certainly a far-cry from the 99.51% accuracy we achieved in Part 1 using NielsenNet. This result is even more remarkable when you consider the fact that the NielsenNet architecture is much simpler and requires far less resources — training on your laptop is feasible in only a few hours or so. What accounts for the discrepancy?
It is instructive to examine the cases where the model failed to correctly predict the assigned label. A few of the images seem to be genuine edge-cases — for some of them, I’m sticking with the machine’s conclusions!
However, the trained model genuinely seems to have problems classifying “loopy” 2’s, classifying many of them as 6’s instead. Inception also seems to stumble when presented with 5’s that look like “S” (classifying them as 2's).
At first, I was a bit miffed — a model architected by the best minds at Google and 36 hours of GPU training can’t outsmart a model that my laptop can run in a few hours flat? While it’s true that I didn’t run the training to convergence, I can’t help be be a bit disappointed. What’s going on here?
It never hurts to go “even deeper” and delve into the source code. As it turns out, one of the tactics that the training pipeline uses to increase the number of training samples is to transform the training image to create additional training data. Flipping the image horizontally is one of these distortions and neatly explains why Inception “fails” on these test cases. Flipping these mischaracterized images reveals why the model missed:
Examining the 95 images that the model missed, I estimate that 25 of these misclassifications can be explained by this quirk, leaving us with a final accuracy of ~99.3%. By comparison, the best classifiers on the MNIST dataset achieve accuracies of >99.6%.
Thanks for sticking it through to the end. By now, I hope you have learned a thing or two about how deep architectures are architected and enough working knowledge about Inception to understand how to train it on a novel dataset, and how to apply it to test instances.
While this tutorial wasn’t meant to to provide any theory, I expect many readers to go even deeper (that’s not going to get old any time soon). Here are some resources:
- Stanford’s CS231n (Convolutional Neural Networks for Visual Recognition) — The online exercises and course notes are an excellent in introduction to neural networks and convolutional neural networks. To fully maximize your learning, spend the time to implement the basic algorithms
- “Neural Networks and Deep Learning” — Another wonderful online textbook by Michael Nielsen.
- “Going Deeper with Convolutions” — The paper that introduces the Inception architecture and the broad principles that underlie its architecture.
- “Rethinking the Inception Architecture for Computer Vision” — Refines and develops the approaches presented in the first Inception paper.
- “Gradient-based learning applied to document recognition” — A classic paper by Yann LeCunn, one of the pioneers in applying neural networks to hand-written digit recognition.
- “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” — Inception makes extensive use of batch normalization.
One of the cool and interesting things about deep learning is that the field is very young. Many of the state-of-the-art architectures and methods are no more than 2–3 years old, and new papers are coming out each month. It’s a very exciting time and I expect more and more clever techniques and approaches to emerge in the next few years.
Update (4/27/2017): Wilkins Chung graciously pointed out an inconsistency between my NielsenNet description and implementation. It has been corrected.
Building something interesting? Initialized Capital would love to chat with you.