We Need to Go Deeper: A Practical Guide to Tensorflow and Inception

I’ve nursed a side interest in machine learning and computer vision since my time in graduate school. When Google released its Tensorflow framework and Inception architecture, I decided to do a deep dive into both technologies in my spare time.

The Inception model is particularly exciting because it’s been battle-tested, delivering world-class results in the widely-acknowledged ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It’s also designed to be computationally efficient, using 12x fewer parameters than other competitors, allowing Inception to be used on less-powerful systems.

I wrote this series because I couldn’t easily bridge the gap between Tensorflow’s tutorials and doing something practical with Inception. Inspired by Inception’s own origins, this tutorial will “go deeper,” presenting a soup-to-nuts tutorial using Inception to train a MNIST (hand-written digits) classifier. While the goal isn’t to get to world-class performance, we’ll get a model that performs at >99% accuracy.

Who Should Use this Tutorial Series?

This is a practical introduction, so it’s not focused on the theories that underly neural networks, computer vision, or deep learning models (though there will be a few remarks about the general motivation behind Inception in Part 2).

Instead, this tutorial is aimed at folks who have done the basic Tensorflow tutorials and want to “go a little deeper” to apply Inception to their own projects. Think graduate student embarking on a project, or a software engineer who’s been asked to build a training and inference pipeline.

The tutorial is roughly divided into 4 parts:

Prerequisites

Obviously, you’ll need a workstation with Tensorflow installed. One of the easiest ways I’ve found is to use Docker and just grab the latest image. This works well for smaller projects and experimentation.

Training deep learning models like Inception is so computationally intensive that running it on your laptop is impractical. Instead, you’ll need access to a GPU. A few options:

  • Build your own: I’ve done most of my training on Amazon using a p2.xlarge GPU instance that I built from scratch. For that, you’ll need to build Tensorflow with Nvidia’s drivers. Directions on how to do that are here.
  • Use a pre-built AMI: Amazon has a pre-built AMI with a variety of pre-packaged frameworks (MXNet, Caffe, Tensorflow, Theano, Torch and CNTK).
  • Paperspace: For those that don’t want to futz with building their own box or the hassle of running a server, Paperspace offers a GPU-enabled linux desktop box in the cloud that is purpose-built for machine learning. Signup is fast and easy and gives you access to a desktop computing environment through your browser.

I’ve also created a Github repository with code samples that we’ll use in the series. Check it out here.

Using Slim to Build Deep Architectures

At the core of Tensorflow is the notion of a computational graph. Operations in our neural network (e.g., convolution, bias adding, dropout, etc..) are all modeled as nodes and edges in this graph. Defining an architecture for a learning task is tantamount to defining this graph.

Tensorflow provides many primitives for defining these graphs and if you’ve run through the introductory tutorials you’ve undoubtedly encountered them when constructing simple neural net architectures. However, as the complexity of our architecture grows, these simple primitives become cumbersome.

Slim is Tensorflow library that bundles commonly used building blocks like convolution and max pooling. Access it by simply importing it:

import tensorflow.contrib.slim as slim

Using slim makes it simple to chain multiple building blocks together. For instance, the following code will create a neural network with two layers, each with 256 hidden units:

def my_neural_network(input):
net = slim.fully_connected(input, 256, scope='layer1-256-fc')
net = slim.fully_connected(net, 256, scope='layer2-256-fc')
return net
input = load_data()
output = my_neural_network(input)

Slim will do all of the heavy lifting; it defines the appropriate weight and bias variables and links them in the appropriate way. Even more conveniently, Slim does all of this under a named scope that you provide allowing you to navigate your architecture in Tensorboard.

NielsenNet: A Guided Example

As a simple example of how Slim builds more complicated architectures, consider the MNIST classifier that is presented in Chapter 6 of Michael Nielsen’s wonderful textbook “Neural Networks and Deep Learning.” The neural network, which I’ve christened “NielsenNet” consists of:

  • An 28x28 input representing a monochrome image of a handwritten digit
  • A convolution layer with 20 kernels, stride=1, size=5, followed by 2x2 max pooling. The convolution layer is padded to maintain the spatial dimensions.
  • Another convolution layer with 40 kernels, stride=1, size=5, again followed by 2x2 max pooling. This time the input is not padded to maintain dimensions.
  • A fully-connected layer of 1000 hidden units with dropout
  • Another fully-connected of 1000 hidden units, again with dropout
  • An output layer of 10, corresponding with the 10 output classes for the MNIST classification problem.

This architecture is implemented here using slim:

The NielsenNet architecture implemented in python using slim

To see how this code is used, I’ve created a Jupyter notebook that trains the NielsenNet against the included MNIST dataset for 100k steps. It ends with an accuracy of approximately 99.51%. Not bad for such a simple network!

NielsenNet architecture in Tensorboard

Using Tensorboard, we can even visualize the graph that’s created, giving you an overview of your architecture and how all of the major pieces connect.

One nice feature of using slim is that your basic building blocks are automatically associated within a named scope, which makes it easy to visualize the overall structure and connectivity of your neural network architecture.

This will become important as the complexity of the models we tackle grow.

Introduction to Inception

Inception was developed at Google to provide state of the art performance on the ImageNet Large-Scale Visual Recognition Challenge and to be more computationally efficient than its competitor architectures. However, what makes Inception exciting is that its architecture can be applied to a whole host of other learning problems in computer vision.

This tutorial focuses on retraining Inception on our old friend, the MNIST dataset. The goal is not to attain world-class performance on digit classification (in fact, using Inception is probably overkill) but to get experience on a known problem. In other words, it’s a toy problem but running through the effort will give us a good good experience to tackle other problems.

High Level Overview of the Inception Architecture

Below is an overview of the Inception architecture that I’ve liberated from the README. I’ve added some annotations which will allow you to get your bearings when you’re looking at Tensorboard histograms or digging into the code.

Overview of the Inception Architecture. Annotations mine. (source)

From 10,000 feet, Inception is basically:

  • An 299x299x3 input representing a visual field of 299 pixels and 3 color (RGB) channels
  • Five vanilla convolution layers, with a few interspersed max-pooling operations
  • Successive stacks of “Inception Modules”
  • A softmax ouput layer at the end (logits) and at an intermediate output layer (aux_logits) just after the mixed 17x17x768e layer

It’s the repeated stacking of the Inception modules that makes this architecture “deep.” At first glance, this architecture doesn’t seem very complicated — after all, we’re just stacking these modules together, right? However, this simple description belies the complexity and nuance hidden inside each Inception module.

While stacking Inception modules leads to depth, each module is also “wide” and architected to recognize features at multiple length scales. In the language of convolutional neural networks, that means introducing convolutions with several filter sizes; in Inception, that means including 3x3 and 5x5 convolutions in each stacked module:

A “naive” Inception module (source)

The downside, of course, is that these convolutions are expensive, especially when repeatedly stacked in a deep learning architecture! To combat this problem, Inception’s architects stacked 1x1 convolutions in front of the expensive 3x3 and 5x5 convolutions to reduce the dimensionality before each convolution:

Using 1x1 convolutions to reduce dimensionality in an Inception Module (source)

While I won’t get into the details, the idea of using 1x1 convolutions (and factoring the 3x3 and 5x5 convolutions into smaller convolutions) to reduce the computational cost is further developed in this article (specifically: “Factorizing Convolutions with Large Filter Size”).

Overall, these approaches drastically reduce the computational cost of Inception relative to other architectures — “only” 5M parameters (compared to AlexNet and VGGNet which use 12X and 36X more parameters, respectively). In turn, this reduction opens up the possibility of employing Inception on platforms with lower available resources.

Training Inception on a Novel Dataset

Now that we’ve taken a quick introduction, let’s tackle a well-known problem: classifying hand-written digits from the MNIST dataset. Luckily, Tensorflow includes this dataset (55k training, 5k validation, and 10k test) as numpy arrays representing the images and labels:

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
mnist.train.images # Shape: 55000 x 784
mnist.train.labels # Shape: 55000 x 10
Sample images from Tensorflow’s included MNIST dataset

Note: This tutorial is computationally intensive so it’s probably not practical to run the training on your laptop. If you need a GPU, here are some directions on building one on Amazon’s EC2.

Training an Inception Model on the MNIST Dataset

Training Inception on your dataset is relatively straightforward since most of the tasks for building the dataset and training are already codified as Bazel tasks, which I won’t repeat in this tutorial:

  1. Create a training set (see here)
  2. Run the training task (see here)

Caveats: If your new dataset has a different number of classes than the base ImageNet case, you’ll have to edit imagenet_data.py to account for this change since that parameter is hardcoded:

def num_classes(self):
"""Returns the number of classes in the data set."""
return 10 # Change this line to 10 for the MNIST dataset

Also, Inception expects square inputs of 299x299, so keep that in mind when creating your dataset.

Using Tensorboard to Monitor Training

One of the most powerful features of Tensorflow is that it can collect and export data about your training; it also includes Tensorboard, a tool that allows you to visualize this exported data. Since the Inception model is implemented using Slim, every single layer of the architecture is instrumented with a variety of scalar, distributions, and histogram charts.

The annotations that I’ve made on the Inception architecture overview will help you navigate the Tensorboard output and allow you to examine how training is progressing in specific parts of the Inception architecture.

While I won’t give a deep tutorial on how to read and interpret these graphs, there are a few things I’ll point out. You’re probably most interested in the total_loss graph, as this directly tracks how well your training is progressing and includes regularization losses:

Watching total_loss allows you to check how your training is progressing

Beyond scalar graphs, Tensorboard makes available histograms (and distributions) that allow you visualize in great detail what is happening within individual Inception modules. Such histograms can reveal how your weights and biases are shifting over time, and how your model is converging.

Using Inception for Inference

Most people, understandably, are interested in applying their trained models to new test cases. While there is a Bazel task included to evaluate a trained Inception model, I found it to be cumbersome since it relies on processed TFRecords. Instead, what most people are probably interested in is a simple way to classify test cases represented as raw numpy arrays.

To aid understanding, I’ve gone ahead and extracted the relevant pieces into a separate Jupyter notebook. The notebook creates an instance of an Inception model whose weights and biases have been loaded from a checkpoint file. To evaluate the accuracy of the training, batches of MNIST test data are fed to the instantiated Inception model for prediction. These predictions are then compared against the annotated data and incorrectly predicted cases are collected and visualized for analysis.

The code in a notebook could easily be adapted for use in a production environment — e.g., serving Inception predictions from a Flask webserver or accepting input from a camera.

How Well Did the Model Perform?

Though the goal of this tutorial was not to achieve world-class accuracy in classifying hand-written digits (as mentioned previously, Inception is overkill), it’s nice to get a feel for how well it does. I trained Inception against the 55k training images in the MNIST dataset for approximately 87 epochs (36 hours on an EC2 p2.xlarge GPU instance). This yielded an accuracy of 99.05%— not world-class, but a good indication that we’re on the right track!

While 99.05% is not too shabby, it’s certainly a far-cry from the 99.51% accuracy we achieved in Part 1 using NielsenNet. This result is even more remarkable when you consider the fact that the NielsenNet architecture is much simpler and requires far less resources — training on your laptop is feasible in only a few hours or so. What accounts for the discrepancy?

It is instructive to examine the cases where the model failed to correctly predict the assigned label. A few of the images seem to be genuine edge-cases — for some of them, I’m sticking with the machine’s conclusions!

Miscategorized test images — do you agree with the labeling?

However, the trained model genuinely seems to have problems classifying “loopy” 2’s, classifying many of them as 6’s instead. Inception also seems to stumble when presented with 5’s that look like “S” (classifying them as 2's).

Why can’t Inception classify these images correctly?

At first, I was a bit miffed — a model architected by the best minds at Google and 36 hours of GPU training can’t outsmart a model that my laptop can run in a few hours flat? While it’s true that I didn’t run the training to convergence, I can’t help be be a bit disappointed. What’s going on here?

It never hurts to go “even deeper” and delve into the source code. As it turns out, one of the tactics that the training pipeline uses to increase the number of training samples is to transform the training image to create additional training data. Flipping the image horizontally is one of these distortions and neatly explains why Inception “fails” on these test cases. Flipping these mischaracterized images reveals why the model missed:

Flipping the mischaracterized images explains why these mis-classifications occurred.

Examining the 95 images that the model missed, I estimate that 25 of these misclassifications can be explained by this quirk, leaving us with a final accuracy of ~99.3%. By comparison, the best classifiers on the MNIST dataset achieve accuracies of >99.6%.

Conclusions

Thanks for sticking it through to the end. By now, I hope you have learned a thing or two about how deep architectures are architected and enough working knowledge about Inception to understand how to train it on a novel dataset, and how to apply it to test instances.

While this tutorial wasn’t meant to to provide any theory, I expect many readers to go even deeper (that’s not going to get old any time soon). Here are some resources:

One of the cool and interesting things about deep learning is that the field is very young. Many of the state-of-the-art architectures and methods are no more than 2–3 years old, and new papers are coming out each month. It’s a very exciting time and I expect more and more clever techniques and approaches to emerge in the next few years.

Update (4/27/2017): Wilkins Chung graciously pointed out an inconsistency between my NielsenNet description and implementation. It has been corrected.

Building something interesting? Initialized Capital would love to chat with you.