TensorFlow 1.0 is here. Let’s do some Deep Learning on the Amazon Cloud!

Mariusz Kierski
7 min readFeb 20, 2017

--

After long development, Google released the first stable version of its Machine Learning library, TensorFlow.

The release is an important milestone in the development of a common Machine Learning toolkit. TensorFlow provides a set of primitives from which Machine Learning engineers and researchers can construct trainable models — as well as a framework to run these computations in an efficient way.

Inside the “mind” of an Artificial Neural Network. Source: https://thenewstack.io/deep-learning-neural-networks-google-deep-dream/

CPU vs. GPU

Deep neural networks training wouldn’t be possible without the advent of affordable, fast computational units. Training most kinds of machine learning models, including deep neural networks, consists of a lot of matrix operations. These operations are computationally heavy.

A modern GPU comprises of multiple, counting in thousands, processing cores with relatively low processing speed per core (as opposed to few cores, strong single core model of a CPU). That means a GPU is highly robust, but only as long as we can distribute our computation.

Fortunately, certain operations, like said matrix multiplications, can be easily distributed and thus are perfectly fit for running on a GPU.

NVIDIA provides excellent GPU units thoroughly optimised for performing machine learning computations, as well as the CUDA framework, which is officially supported by TensorFlow. CUDA also provides a lot of helpful diagnostic utilities for debugging GPU programs (more on that later).

Source: http://www.nvidia.com/object/what-is-gpu-computing.html

How different is developing applications with TensorFlow?

Classical programs execute instructions sequentially, line by line, with control flow instructions.

With TensorFlow, there are two phases:

  1. Defining the computation graph as a series of math operations on matrices. We declaratively design the computational graph using a high-level language, such as Python:
A computation graph for a simple neural network with one hidden layer. Source: http://www.iotenthu.com/2016/01/tensorflow-post-what-is-tensorflow-part-2-programming-model/

2. Running the graph on a CPU or a GPU, parameterising the graph with inputs (it’s called a Session). The Session object holds the state for an ongoing computation. TensorFlow then automatically translates our graph into assembly code for a target machine, whether it’s a CPU, GPU or even entirely different back-end.

A supercomputer at home

Most of the machines commonly used, especially laptops, don’t feature a GPU capable of optimising neural networks training. This seriously limits the quality of model we can achieve.

A common method to increase accuracy of a model is running multiple iterations of training on the same training set. Time necessary to train a model is directly proportional to the number of iterations of training.

Leveraging a strong GPU can bring the time of training down from weeks to hours. Yet the price of a dedicated GPU alone starts with $5,000… Ouch!

This baby will set you back about $5,000

Solution: The cloud

Fortunately, the cloud providers like Amazon AWS and Google Cloud are doing their best to satisfy rising high-volume computation appetites.

In September of last year, Amazon AWS introduced EC2 P2, a cloud machine instance class featuring a NVIDIA Tesla K8 GPU suitable for machine learning applications. With EC2 P2, we can simply lease a machine starting at $0.90/hour (at the time of writing this article), and use it to train our models in an efficient way.

OK, enough talking… let’s get our hands dirty!

Setting up an EC2 P2 instance

Our objective will be to set up an EC2 P2 instance that we can use to run GPU computations with TensorFlow. We’re going to use the excellent Jupyter Notebook (formerly IPython Notebook) to experiment. We use this setup to provide machines for our deep learning workshops.

To use the AWS Cloud you must sign up and register your details and credit card, if you haven’t already. This process is described here.

EC2 P2’s are not normally enabled in the EC2 management console. They’re also not available in all AWS regions.

Because P2 is a relatively new service, to enable P2 instance creation, you need to request the service limit increase. To do that, open a ticket in the AWS Support:

Within 24–48 hours, your P2 limit will be increased and you can set up a machine. I waited about 8 hours, and have been called via phone by an AWS representative:

Press Next: Configure Instance Details. We need to refine some settings.

For this experimental setup, I recommend a Ubuntu 16.04 64-bit box. You’ll need to increment the default disk quota from 8 GB, to 16 GB or more.

Next, make sure your security group allows for inbound connections on port 8888. It will be required for the Jupyter Notebook to work.

The last step would be to set up a key pair. You’ll need this to SSH into your newly created machine.

Machine is now running. Setting everything up is quite a pain, so I created a quick-and-dirty convenience script that you can just execute via SSH on an already running machine. It will install TensorFlow 1.0, Anaconda with Python 3.6 and host a Jupyter Notebook server on port 8888 that you can access from your local machine.

Note: before executing the script, you need to put libcudnn, NVIDIA’s deep networks library, in the /tmp directory of the machine. To obtain that library, register at the NVIDIA website first, then download the files.

Security advisory: the machine isn’t very secure. If you’re working with sensitive data, do make sure that the cloud setup you’re using is better than this one provided (encryption and stuff).

To obtain your machine’s IP, just open up the EC2 console:

Now SSH into the machine using the .pem key file and run the script.

The last step is opening http://[your-machine-ip]:8888 in your browser… and voilá!

A Jupyter Notebook ready for our deep learning journey :)
Default password: sigmoidal

Training a simple deep network

Let’s train a neural network. We’re going to build a variant of a Convolutional Neural Network (CNN) — which is a kind of a deep network particularly fit, among others, for the image classification task — deciding, which one of predefined classes the image belongs to.

Our network is a simplified version of the ground-breaking AlexNet, which, back in 2012, set the new standard for the quality of performing image classification by a computer.

I’ve chosen the popular CIFAR-10 dataset, which contains images of everyday items and animals classified with 10 labels. Our objective is to build a system that precisely distinguishes between these classes.

I’ve prepared a Jupyter notebook, with a deep network, to play with. To work with it, just put it in ~/notebooks directory of your machine and open it in Jupyter. Download it here.

This will download the data and run a classification task on the sample dataset. Because of GPU acceleration we can train it with more iterations. With the default settings I’ve provided — 6000 iterations, 256 batch size, the training takes about two minutes.

We’re getting about 67% accuracy on the test set. It’s not bad for a network this simple; but we can do much better — current state-of-the-art (best result known in the academic world) is 96% for this dataset. Even without implementing more sophisticated methods, increasing the number of hidden nodes and training iterations should give us better results. Try it yourself!

Note how we have much better results on the subsequent training mini-batches, compared to the test set. This is due to overfitting to the training data. There are multiple ways to handle this, but it’s beyond the scope of this introductory post.

Some predictions on a test set with 67% accuracy. Not everything quite right :(

Is it really running on a GPU?

TensorFlow applications run on a GPU by default if it’s available. If your network training takes too long, perhaps it’s worth making sure the GPU is being utilised.

To see that utilisation in real time, SSH into the machine and type:

$ watch -n 0.5 nvidia-smi

You should see something similar to this, updated once in half a second:

Here, my instance’s GPU is being utilised at 71% — a screenshot taken while training the network from notebook

Don’t forget to stop the machine and wash your hands after you finish!

It’s quite costly, paying $0.90/hour for nothing is not much fun. Note that if you stop the machine, you’ll still be charged a little for instance disk usage — that will retain your notebooks and data though if you decide to use this instance again. If, however, you’re done for now, terminate the machine instead to avoid extra cost. It will later disappear from the list and you’re going to have to set it up from scratch.

About me

I’m open to any questions you might have with TensorFlow and machine learning, and always willing to help — just ping me at mariusz@sigmoidal.io

I’m a founder of Sigmoidal, a boutique consultancy where we tackle machine learning and AI problems and advise our customers on the right use of AI for the good of their businesses. (Or, the world, in general.)

--

--