Generating pictures with neural network on AWS Sagemaker with GPU acceleration


If you have to deal with machine learning in your everyday work life (like we do at Unit8), there comes a moment when you need to run some intensive computations to train your model. If you are lucky and have a desktop with a powerful GPU, problem solved — you can happily run the training locally. If you are less lucky and don’t have a GPU at your hand, you need to run your computations somewhere. Perhaps the cloud? I stumbled onto this problem when I tried to run a not-so-optimised Keras-written ML training algorithm. This article describes an approach I took with AWS to make my algorithm run with GPU-powered computations. 
So far, the typical workflow was to first start a VM with a provisioned GPU in a cloud provider of your choice, then start to work on model development and training. In case of AWS, your workflow could look like this:

Black brace— what you pay for, light blue arrow — what you get

As you can see, your VM is acquiring and holding onto the GPU even if you are not actually using it — during all those preparation and evaluation steps GPU is simply sitting there idle and waiting for tasks.
This of course comes at a price — all the time you spend fixing your bugs/working on the code itself you are being charged, which can incur a lot of costs (1 to even 20+$ per hour!). Not to mention the case when you forget to switch it off. Yikes.
 This is why AWS comes up with its own framework to handle this and many other problems occuring during your everyday work with ML-related tasks — AWS Sagemaker.

Sagemaker — what it is?

Sagemaker is a set of tools offered by Amazon to handle ML problems, like:

  • data cleaning
  • model training with/without GPU support
  • hyperparameter tuning
  • model serving

Most of the problems is tackled via provided and pre-configured Jupyter notebook with some additional Sagemaker python libs. Those libs are meant to help with AWS-related tasks. We can for example run training on chosen GPU-backed EC2 and have all the logs and models stored on S3 and CloudWatch. We even have some AWS-provided implementations of most popular algorithms (e.g. K-means).

This blog posts focuses on the training procedure. With AWS Sagemaker it would look like this.

Prepare your training job and only then spin up separate VM with GPU

In the following article, I described my journey to take a slow and extremely long-running training job and speed up the process using AWS-provided GPU.

Problem — backstory

I was experimenting a bit with GAN (Generative Adversarial Networks). If you don’t know what it is, you can find a very cool introduction here. After playing a bit with training/generating simple things like MNIST dataset, I wanted to try out something nicer and stumbled onto this implementation. I started the training procedure on my Mac and noticed:

Epoch 1 of 1
0/1000 […………………………] — ETA: 0s
1/1000 […………………………] — ETA: 1:33:12

ETA for 1 iteration for roughly 1.5 hour! This would mean that full training could last 1000*1.5=1500 hours=62 days!

Okay, Keras training code could be probably optimised if written as one model in Tensorflow instead of two (discriminator+generator). However, I found this example to be a good excuse to try out AWS-provided GPUs. So here we are.

Problem — TL;DR

  • We want to reuse ACGAN network implementation in Keras provided here
  • At some point the network can be reused for different purposes than CIFAR10
  • Networ should have a possibility to be trained further (no need to start from scratch)
  • Locally on MacOS CPU it is terribly slow — 1 iteration takes ~1.5 hour, to have reasonable results we need ca. 200
  • We decide to utilize AWS-provided GPU’s and use AWS dedicated Sagemaker tool to get on with model training.

To add more problems…

The Sagemaker workflow I described in previous section works only in some particular cases:

  • You use Sagemaker builtin algorithms
  • You use one of the supported frameworks (raw Tensorflow, Pyspark etc., for full list consult FAQ)

You can still use your custom training algorithms, but for that you have to provide your own Dockerfile that includes all the necessary libs and dependencies. AWS provides you with example notebook for scikit.
Notebook can be used as a good base to start, but unfortunately is not designed to run with GPU support and some amendments have to be made.

Preparing the docker image

We are interested in using Keras with Tensorflow backend utilizing GPU, therefore we have limited options. Looking at the Tensorflow documentation, we see the following requirements:

  • 64-bit Linux
  • Python 2.7
  • CUDA 7.5 (CUDA 8.0 required for Pascal GPUs)
  • cuDNN v5.1 (cuDNN v6 if on TF v1.3)

It is not so easy to use GPU acceleration when we are inside a Docker container. For that, we need to run enhanced demon called nvidia-docker and pick one from the predefined images in nvidia-docker project. Luckily, the daemon is already preinstalled on all Sagemaker-supported GPU instances so no action needs to be taken here. To fulfil Tensorflow requirements, I decided to go with cuda:9.0-cudnn7-runtime-ubuntu16.04 . We can pick runtime version instead of devel oneto reduce image size. Full Docker image:

# Build an image that can do training  in SageMaker
# This image contains CUDA 9.0 (CUDA libs are backward-compatible), cuddn version 7 and 64bit ubuntu
FROM nvidia/cuda:9.0-cudnn7-runtime-ubuntu16.04
# FROM ubuntu:16.04 - if we do not want cuda support
RUN apt-get -y update && apt-get install -y --no-install-recommends \
wget \
python \
nginx \
ca-certificates \
python-dev \
python-tk \
gcc \
g++ \
libopenblas-dev \
&& rm -rf /var/lib/apt/lists/*
# Here we get all python packages.
RUN wget && python && \
pip install numpy scikit-learn pandas flask gevent gunicorn matplotlib tensorflow-gpu keras Pillow six
# Set some environment variables. PYTHONUNBUFFERED keeps Python from buffering our standard
# output stream, which means that logs can be delivered to the user quickly. PYTHONDONTWRITEBYTECODE
# keeps Python from writing the .pyc files which are unnecessary in this case. We also update
# PATH so that the train and serve programs are found when the container is invoked.
ENV PATH="/opt/program:${PATH}"
# Set up the program in the image
COPY keras-nn /opt/program
WORKDIR /opt/program

Important remarks:

  • Image: nvidia/cuda:9.0-cudnn7-runtime-ubuntu16.04
  • To have GPU support, remember about installing tensorflow-gpu
  • I skipped the symbolic-linking from the original notebook, because it was causing some issues with binary incompatibilities of packages
  • DO NOT mount anything in /opt/ml directory, it is reserved for Sagemaker and will be overwritten
  • Docker image has to be pushed to ECR, Sagemaker doesn’t like to cooperate with image registries from different providers
  • Docker starts training by running command train, make sure train script is available in the WORKDIR and has execute permissions

Docker image — directory mapping

After we submit our job for training, we have to take into account some special rules imposed by Sagemaker. Since it has to somehow provide I/O for our training algorithms, it mounts the following paths (following the description from the original notebook):

├── input
│ ├── config
│ │ ├── hyperparameters.json
│ │ └── resourceConfig.json
│ └── data
│ └── <channel_name>
│ └── <input data>
├── model
│ └── <model files>
└── output
└── failure
  • /opt/ml/input — we use it to provide input data/tuning parameters to our training jobs. AWS can also take care of bringing the data in from e.g. S3 and mounting it here. Since the CIFAR dataset is builtin in Keras library and we do not really plan to do any tuning so far, we can just leave this whole dir blank.
  • /opt/ml/model — this is where we store the resulting model of our training, which is then packed into tar.gz and shipped to S3.
  • /opt/ml/output/failure — this is where we can write out stacktraces and error messages that will be available later in case of errors.

All of the above paths are available as normal block storage and can be written to by any standard IO library.

Train algorithm — modifications

For the following chapter, the whole file can be reviewed here. In the post I only highlight the key components.

We need to slightly change the original algorithm by applying the following modifications:

  • Change the filename from to train (or add train script that calls your actual code)
  • Use the mounted directories to store your output:
prefix = '/opt/ml/'
output_path = os.path.join(prefix, 'output')
model_path = os.path.join(prefix, 'model')

Note: here we abuse the “model” directory a bit to also store images generated every n iterations. They also will be shipped in the resulting model.tar.gz file.

  • We add an instruction to write stacktrace to failure file in case of an Exception and (important) return non-zero exit code
except Exception as e:
# Write out an error file. This will be returned as the failureReason in the
# DescribeTrainingJob result.
trc = traceback.format_exc()
with open(os.path.join(output_path, 'failure'), 'w') as s:
s.write('Exception during training: ' + str(e) + '\n' + trc)
print('Exception during training: ' + str(e) + '\n' + trc, file=sys.stderr)
# A non-zero exit code causes the training job to be marked as Failed.
  • We add a save_interial variable to not store the output on every iteration and reduce the size of the resulting model.tar.gz
  • Discriminator after saving was quite big (>150 MB), this is why I decided write only the latest discriminator model and every 60th generator model with images:
if epoch % save_interval == 0:
images_dir = os.path.join(model_path, "images")
if not os.path.exists(images_dir):
Image.fromarray(img).save(os.path.join(images_dir, 'plot_epoch_{0:03d}_generated.png'.format(load_epoch))), 'generator_{0:03d}.h5'.format(load_epoch))), 'discriminator.h5'.format(load_epoch)))

Sagemaker upload

After initial preparation we can start AWS SageMaker and create a new notebook instance. We then upload our files, keeping the structure like in this directory.

Submitting the job

Following the keras_on_gpu.ipynb notebook, we build docker image as described and push it to ECR registry. We then instruct SageMaker to run our training job on new VM and store the output in S3.

tree = sage.estimator.Estimator(image,
role, 1, 'ml.p2.xlarge',
output_path="s3://{}/output_keras_gpu".format(sess.default_bucket()), sagemaker_session=sess)

Interesting arguments:

  • image is the url to ECR registry pointing to the pushed Docker image.
  • 1 — number of training instances. ATM we just need 1 instance with GPU.
  • ‘ml.p2.xlarge’ — AWS ml instance type to use, cheapest one with GPU. Comes shipped with basic libs and support for running nvidia-docker daemon. Warning. Pick your instance type carefully, some of them can cost over 20$ per hour!
  • output_path — location to store output of operation (you can find your model.tar.gz here)

After submitting the job, we can simply close the notebook, no need to watch the output and be charged for idling non-GPU instance. Job progress can be tracked in Sagemaker UI (Tab Training/Training jobs).

Accessing status and logs of a job

After accessing link, we can see basic information about our job, instance we use and access CloudWatch metrics as well as logs of whatever is printed to the output inside the container. 
 I’ve decided to run 160 iterations of my model (okay, 161) and see how far the network gets. After looking at Cloudwatch looks, we can see that time to train 1 epoch has decreased from 1.5 hour to roughly 5 minutes.
 It means that instead of waiting 4 days I can get my model up and running in 16 hours which improves our time by a factor of 6 — that’s some speedup — and I used the cheapest GPU available on AWS!

Obtaining the model

When we notice Completed status in our training jobs, we can happily access whatever was produced on S3 instance in form of zipped model.tar.gz. As you can see, in my case it looks like this:

Content of model.tar.gz including generator and discriminator models and some sample pictures

After opening the zip we can see that network evolved all the way from iteration 1:

After iteration 1 — network already tries to generate some sky/ground

To the last iteration:

Last iteration — not really photorealistic but we can clearly spot which row shows airplanes and which horses.

Cool! So the network did learn something and we want to somehow use it. We now have two options:

  • Use SageMaker to serve model — Similar to train, we can also provide serve function that will take care of using AWS infrastructure to expose the model to outside world. This is however not in scope of this entry and it’s quite well documented in the original scikit notebook.
  • Download model and run it locally — This is the option I’ve chosen, we can download the model and run it locally.

Running the model locally
 Following the read_model.ipynb we can read our model as follows:

from keras.models import load_model
generator = load_model('generator.h5')

We can then use our trained generator to create CIFAR-similar images: Let’s try to ask our network to draw some birds.

latent_size = 110
generate_class = 2 # Choose an image class to generate, here 2 - birds
noise = np.random.normal(0, 0.5, (100, latent_size))
sampled_labels = np.array([
[generate_class] * 10 for i in range(10)
]).reshape(-1, 1)
generated_images = generator.predict([noise, sampled_labels]).transpose(0, 2, 3, 1)
generated_images = np.asarray((generated_images * 127.5 + 127.5).astype(np.uint8))

And the outcome is:

Yes, that looks like birds

How about horses? Planes? Ships?

Horses, planes and ships. We can see that network managed to recognize characteristic features for all of them


In the above blog post I’ve describe how to take practically any algorithm and train is using GPUs and AWS SageMaker framework. To shortly sum up, you need to:

  • pick GPU-compatible framework of your choice
  • prepare nvidia-docker image with all the additional dependencies you need
  • prepare your training functions and data to be compatible with SageMaker requirements
  • submit model for trainig
  • download model and do whatever you want with it


All my described work is available in this github repository:

Not everything is beautiful so contributions are very welcome!
Interesting highlights:

Future work

So far I only described how to train on a simple (and to be honest quite old) ML dataset.

I’m considering writing further entries about our work with ML/AWS — describe a splendid piece of work my colleague at Unit8 did to automatize pneumonia detection and how can it be deployed and served on a cloud. 
 Thanks for reading and let’s keep in touch!