TensorFlow + SageMaker = ❤

Enias Cailliau
Radix
Published in
6 min readNov 21, 2018

Amazon SageMaker is a managed machine learning service (MLaaS). The platform lets you quickly build, train and deploy machine learning models. In this blog post, we’ll guide you through a typical development process using the latest developments from SageMaker such as pipe input mode and automatic model tuning.

This article is the second article in the series: “Accelerating deep learning development using SageMaker”. During this series, we explore the added value of using SageMaker for Deep Learning development. Part 1 provides a general overview of what SageMaker has to offer. This second part will provide a more technical view on how development lifecycles can be accelerated using the SageMaker ecosystem.

In this tutorial, we will develop a Convolutional Neural Network (CNN) to classify fashion items in the Fashion MNIST dataset from Zalando. This dataset contains 55.000 training samples and 10.000 validation samples. Each sample is a 28x28 grayscale image of a clothing item linked to a label. In total, there are ten possible labels; t-shirts, trousers, pullovers, dresses, coats, sandals, shirts, sneakers, bags, and ankle boots.

This tutorial will guide you through three essential stages in machine learning development; data-engineering, optimisation and deployment. We will use SageMaker to make it easier to iterate from one model to another. First, we will prepare the dataset so that we can stream data to our model during training. Then we will optimise our model using a distributed version of hyperparameter tuning. As the last step, we will deploy the TensorFlow model as a service.

This article is written in a code-along format. If you want to dig into the code yourself, you can find useful notebooks on the GitHub repository of this article.

Part 1: prepare data for streaming

SageMaker offers two modes for streaming datasets to your model: PIPE mode and FILE mode. PIPE mode reduces overhead by streaming training and validation data straight from S3 storage. Using this input mode has three key advantages compared to downloading a file in FILE mode:

  1. Training jobs can start sooner since they don’t need to wait for the full dataset.
  2. Training instances require less storage space since the dataset is never physically stored.
  3. Streaming data from S3 is faster than streaming from a local file since the S3 filehandles are highly optimised and multi-threaded.

More information on PIPE input mode can be found here.

Transforming dataset samples to TFRecords

We first need to transform our data into a data structure which is compatible with the PIPE input mode. Currently, PIPE supports two file formats: RecordIO encoding and TFRecord encoding. For our development, we will use TFRecord encoding.

First, go to the SageMaker console and create a fresh SageMaker Jupyter Notebook instance using an instance type of your choice. During the development of this article, I used the ml.t2.xlarge instance type.

Once your Jupiter environment is ready, you can open it and create a new notebook with the conda_tensorflow_p36 kernel. We can verify that Tensorflow is available by printing the Tensorflow version.

Now that we verified Tensorflow is available we can start preparing our dataset. First, we import the dataset and observe the shapes (dimensions) of the training set and validation set.

Optional: If you want you can have a look at some examples from the MNIST Fashion dataset.

We will write a helper function that stores a dataset in the form of TFRecords. The code for this helper function is not that expressive but lines 25–30 contain the essence; We store each sample (image + label) in a single TFRecord. When saving the image in a binary representation, we lose some information such as height, width and number of colour-channels. Therefore we also store these dimensions in the TFRecord format.

If you want to understand how to transform a dataset into TFRecords, you can read more about this binary format in this blog.

Storing TFRecords on s3

Let’s now transform the dataset into TFRecords using our helper function. Since the pipe input mode is based on S3 filehandles, it is required to store the complete dataset on S3. We’ll save the data using the upload_data function. Once the store operation completes, it returns an s3 url which points to the uploaded data.

Part 2: Model design

Now that our dataset is available on s3 we can start implementing our CNN in TensorFlow. First, create a new notebook using the conda_tensorflow_p36 kernel. We named our notebook tune_fashion_network.ipynb. At the top of this notebook, make sure the appropriate dependencies are available.

Build a TensorFlow model that is compatible with SageMaker

SageMaker heavily relies on TensorFlow’s Estimator API. Because of this dependency, you are required to write your model according to the specifications of this API. At the bare minimum SageMaker requires an entry_point file which includes three definitions; First, you need to define the architecture of your model using a model_fn(). Second, you need to describe how data is fed into the model during training and validation using the train_input_fn() and eval_input_fn() handles.

PipeModeDataset (used in the function _input_fn()) is an implementation of the TensorFlow Dataset API which enables the use of the pipe input mode. The complete definition of our entry_point file can be found here. It’s perfectly reasonable if you don’t understand everything just yet, each line will become clear as we work our way through this article.

Finally, we wrap our TensorFlow model inside a sagemaker.tensorflow.estimator.TensorFlow object to make it compatible with SageMaker’s services (more information can be found here). This wrap operation allows us to deploy the model on an instance of type ml.c5.2xlarge for 20.000 iterations. If you want to use a faster machine for training your model you can do so by changing the train_instance_type parameter. Note that we explicitly set the input_mode to ‘Pipe’ to force the usage of PIPE input mode.

Enable hyperparameter tuning in SageMaker

Now that we defined our model we can focus on optimising its hyperparameters. SageMaker offers automatic hyperparameter tuning through the Automatic Model Tuning service. The service spools up multiple machines to figure out the optimal hyperparameter configuration through Bayesian optimisation.

Using automatic model tuning involves three steps:

  1. Define the training objective. The current implementation grabs the objective metric from the logs using regular expressions. Therefore, If you want to use an alternative objective such as accuracy you need to log this metric during training.
  2. Define hyperparameter ranges.
  3. Create a HyperparameterTuner instance which defines the number of jobs that it is allowed to use.

More information on how to use automatic model tuning can be found through this link.

We can now trigger hyperparameter search by calling the fit function on our HyperparameterTuner instance. Notice that we pass the training data and validation data through a dictionary. The dictionary instructs SageMaker to open two channels: ‘train’ and ‘eval’. These channels are then processed by our _input_fn() we wrote earlier (see cnn_fashion_mnist.py).

When writing this article, I trained the tuner for 16 times which resulted in a loss reduction of 50% (from 0.55 to 0.26).

Putting your model into production

The last step in the production pipeline is to serve the model as a service. Sagemaker facilitates this process through its sagemaker.tensorflow.estimator.TensorFlow implementation.

First, we will train our model for a few more iterations on a faster instance (in this case the ml.p2.xlarge).

After training the model, we can deploy it using the deploy function. Notice that we use a much cheaper instance_type since inference isn’t as compute intensive as training.

To test if the service is up and running, we can send a randomly initialised image and check if the API replies.

Optional: We can further analyse the performance of the model using the test set. For some example code on how to do that you can check out our notebook.

Conclusion

During this tutorial, we discovered the value of using SageMaker when developing deep learning systems. We’ve shown that SageMaker has excellent support for native TensorFlow implementations by offering efficient and distributed hyperparameter tuning.

--

--