Custom Keras based Machine Learning Model Training with Amazon SageMaker

3 min readJun 17, 2019

Interested in such topics or need some help with them?
Get in touch — https://linktr.ee/pranaychandekar

Hi! I am Pranay Chandekar. I currently work as a Machine Learning Engineer in Hyderabad, India and today we will discuss on how to train a Keras based ML model on a GPU using Amazon SageMaker. In this reading, I will share my motivation behind this article followed by the overview of the architecture and the steps to follow. Though I have shared a link to a detailed tutorial at the end of this article, I would recommend to read this article first and only then move on to the tutorial.

Let’s begin!

Motivation

Since last few months, I was working on redefining the ML architecture for different ML projects in our organization. The objective was to streamline the development process, increase efficiency, ensure the availability of high-end resources and reduce the cost associated with it. On my mission to fulfil this objective, I stumbled upon Amazon SageMaker. It is a dedicated ML cloud service provided by AWS. We realized that with a combination of various AWS like SageMaker, Lambda etc. we can not only achieve our objective but also automate the ML pipeline.

Though the official documentation is available, the use of SageMaker is slightly complicated. It takes a significant amount of time to comprehend the details and successfully run a model in SageMaker. And so I want to share the knowledge I gained in public interest.

Training Architecture Overview

As we can see in the above diagram, along with SageMaker we will be using Amazon S3 and Amazon ECR. Let us look at the individual components in the above diagram.

SageMaker Notebook Instance: A server with pre-installed jupyter-notebook and python environment. We will use this instance to build the docker image of our algorithm, test it locally, push it to ECR and launch a training job on SageMaker Training Instance.
ECR: The Elastic Container Repository or ECR is a service in AWS where we can upload the docker images.
S3: The Simple Storage Service or S3 is a storage service in AWS which we will use to store our training data and the trained model.
SageMaker Traning Instance: The server on which our model will be trained. It can be either a CPU or a GPU as per our requirements.

Steps to be followed

Create an S3 bucket with two folders. First one for data and the second one for the trained model. Upload the data in the data folder. The training job will pick the data from the data folder and store the model in the output folder after training.
Create a SageMaker Notebook instance as per your requirements. Clone your project or start a new project. (For more details on project structure and conventions follow the tutorial link at the end of this reading.)
Create a docker image of your algorithm and test the algorithm locally on the notebook with less no. of epochs.
After the successful local test run. Push the algorithm image to ECR.
Define the training job using SageMaker Estimator API. This is where we define the instance type(CPU or GPU), no. of instances, hyperparameters and the S3 output path.
Run the training job by passing the S3 data path. The training job will launch a SageMaker Training instance, download data and start the training. Once the training is done, it will upload the trained model to the S3 bucket and terminate the training instance.

Congratulations! Now you are familiar with the Amazon SageMaker. With this knowledge, you will be now able to understand the various examples provided by Amazon and others on the internet.

If you are a developer and are interested in implementing a Keras model in Amazon SageMaker then I would recommend you to follow my detailed tutorial on the same. Please find the tutorial here.

pranaychandekar/keras-sagemaker-train

The project will help you understand and set up your custom keras project in Amazon SageMaker …

github.com