Training with Keras-MXNet on Amazon SageMaker

Julien Simon
Datree
Published in
6 min readJun 2, 2018

As previously discussed, Apache MXNet is now available as a backend for Keras 2, aka Keras-MXNet.

In this post, you will learn how to train Keras-MXNet jobs on Amazon SageMaker. I’ll show you how to:

  • build custom Docker containers for CPU and GPU training,
  • configure multi-GPU training,
  • pass parameters to a Keras script,
  • save the trained models in Keras and MXNet formats.

As usual, you’ll find my code on Github :)

That’s how it feels when your custom container runs without error :)

Configuring Keras for MXNet

All it takes is really setting the ‘backend’ to ‘mxnet’ in .keras/keras.json, but setting ‘image_data_format’ to ‘channels_first’ will make MXNet training faster.

When working with image data, the input shape can either be ‘channels_first’, i.e. (number of channels, height, width), or ‘channels_last’, i.e. (height, width, number of channels). For MNIST, this would either be (1, 28, 28) or (28, 28, 1) : one channel (black and white pictures), 28 pixels by 28 pixels. For ImageNet, it would be (3, 224, 224) or (224, 224, 3): three channels (red, green and blue), 224 pixels by 224 pixels.

Here’s the configuration file we’ll use for our container.

Building custom containers

SageMaker provides a collection of built-in algorithms as well as environments for TensorFlow and MXNet… but not for Keras. Fortunately, developers have the option to build custom containers for training and prediction.

Obviously, a number of conventions need to be defined for SageMaker to successfully invoke a custom container:

  • Name of the training and prediction scripts: by default, they should respectively be set to ‘train’ and ‘serve’, be executable and have no extension. SageMaker will start training by running ‘docker run your_container train’.
  • Location of hyper parameters in the container: /opt/ml/input/config/hyperparameters.json.
  • Location of input data parameters in the container: /opt/ml/input/data.

This will require some changes in our Keras script, the well-known example of learning MNIST with a simple CNN. As you will see in a moment, they are quite minor and you won’t have any trouble adding them to your own code.

Building a CPU-based Docker container

Here’s the Docker file.

We start from an Ubuntu 16.04 image and install :

  • Python 3 as well as native dependencies for MXNet.
  • the latest and greatest packages of MXNet and Keras-MXNet.

You don’t have to install pre-releases packages. I just like to live dangerously and add extra spice to my oh-so-quiet everyday life :*)

Once this is done, we clean up various caches to shrink the container size a bit. Then, we copy :

  • the Keras script to /opt/program with the proper name (‘train’) and we make it executable.

For more flexibility, we could write a generic launcher that would fetch the actual training script from an S3 location passed as an hyper parameter. This is left as an exercise for the reader ;)

  • the Keras configuration file to /root/.keras/keras.json.

Finally, we set the directory of our script as the work directory and add it to the path.

It’s not a long file, but as usual with these things, every detail counts.

Building a GPU-based Docker container

Now let’s build its GPU counterpart. It differs in only two ways:

  • we start from the CUDA 9.0 image, which is also based on Ubuntu 16.04. This one has all the CUDA libraries that MXNet needs (unlike the smaller 9.0-base, don’t bother trying it).
  • we install the CUDA 9.0-enabled MXNet package.

Everything else is the same as before.

Creating a Docker repository in Amazon ECR

SageMaker requires that the containers it fetches are hosted in Amazon ECR. Let’s create a repo and login to it.

Building and pushing our containers to ECR

OK, now it’s time to build both containers and push them to their repos. We’ll do this separately for the CPU and GPU versions. Strictly Docker stuff. Please refer to the notebook for details on variables.

Once we’re done, things should look like this and you should also see your two containers in ECR.

The Docker part is over. Now let’s configure our training job in SageMaker.

Configuring the training job

This is actually quite underwhelming, which is great news: nothing really differs from training with a built-in algorithm!

First we need to upload the MNIST data set from our local machine to S3. We’ve done this many times before, nothing new here.

Then, we configure the training job by:

  • selecting one of the containers we just built and setting the usual parameters for SageMaker estimators,
  • passing hyper parameters to the Keras script,
  • passing input data to the Keras script.

That’s it for training. The last part we’re missing is adapting our Keras script for SageMaker. Let’s get to it.

Adapting the Keras script for SageMaker

We need to take care of hyper parameters, input data, multi-GPU configuration, loading the data set and saving models.

Passing hyper parameters and input data configuration

As mentioned earlier, SageMaker copies hyper parameters to /opt/ml/input/config/hyperparameters.json. All we have to do is read this file, extract parameters and set default values if needed.

In a similar fashion, SageMaker copies the input data configuration to /opt/ml/input/data. We’ll handle things in exactly the same way.

In this example, I don’t need this configuration info, but this is how you’d read it if you did :)

Loading the training and validation set

When training in file mode (which is the case here), SageMaker automatically copies the data set to /opt/ml/input/<channel_name>: here, we defined the train and validation channels, so we’ll simply read the MNIST files from the corresponding directories.

Configuring multi-GPU training

As explained in a previous post, Keras-MXNet makes it very easy to set up multi-GPU training. Depending on the gpu_count hyper parameter, we just need to wrap our model with a bespoke Keras API before compiling it:

Ain’t life grand?

Saving models

The very last thing we need to do once training is complete is to save the model in /opt/ml/model: SageMaker will grab all artefacts present in this directory, build a file called model.tar.gz and copy it to the S3 bucket used by the training job.

In fact, we’re going to save the trained model in two different formats : the Keras format (i.e. an HDF5 file) and the native MXNet format (i.e. a JSON file and a .params file). This will allow us to use it with both libraries!

That’s it. As you can see, it’s all about interfacing your script with SageMaker input and output. The bulk of your Keras code doesn’t require any modification.

Running the script

Alright, let’s run the GPU version! We’ll train on 2 GPUs hosted in a p3.8xlarge instance.

Let’s check the S3 bucket.

$ aws s3 ls $BUCKET/keras-mxnet-gpu/output/keras-mxnet-mnist-cnn-2018-05-30-17-39-50-724/output/
2018-05-30 17:43:34 8916913 model.tar.gz
$ aws s3 cp $BUCKET/keras-mxnet-gpu/output/keras-mxnet-mnist-cnn-2018-05-30-17-39-50-724/output/model.tar.gz .
$ tar tvfz model.tar.gz
-rw-r--r-- 0/0 4822688 2018-05-30 17:43 mnist-cnn-10.hd5
-rw-r--r-- 0/0 4800092 2018-05-30 17:43 mnist-cnn-10-0000.params
-rw-r--r-- 0/0 4817 2018-05-30 17:43 mnist-cnn-10-symbol.json

Wunderbar, as they say on the other side of the Rhine ;) We can now use these models anywhere we like.

That’s it for today. Another (hopefully) nice example of using SageMaker to train your custom jobs on fully-managed infrastructure!

Happy to answer questions here or on Twitter. For more content, please feel free to check out my YouTube channel.

Time to burn… some clock cycles :)

--

--