Steering Your ML Experiments with AWS SageMaker

Aditya Khokhar
TechStreet
Published in
9 min readJan 22, 2019

AWS SageMaker has been in news lately owing to re:Invent 2018 launch of multiple features and services and in general the overall buzz about AI and Deep Learning. The AI industry is moving fast with new developments at every front, be it the hardware part of GPUs, the core algorithm layer where new research papers are coming up with scalable and robust methods of improving AI, the frameworks layer strengthened by incremental releases in TensorFlow, Keras, etc. or the services layer with AWS, Google Cloud Platform (GCP) and others launching new offerings every now and then.

Needless to say all this has also kind of created a chaos. For any new budding AI enthusiast what to chose is a daunting question, the answer to which is never really found in the depths of countless blog posts spread over the web.

But what exactly is the issue? Where do people find roadblocks when playing and experimenting with their initial ML tasks.

For any experimentation with an ML project there are well defined states namely,

  • Choosing the problem to solve and rationale for applying ML techniques
  • Framing the ML problem
  • Getting hold of data and preprocessing it
  • Choosing the right architecture
  • Training the model
  • Tuning-up the hyper-parameters
  • Deploying the model

The above steps are on a very high level, in-depth discussion of each is beyond the scope of this article. Although all of the above mentioned activities in itself is a deep dive but especially data operations and training your model are where many get stuck. The primary reason for this is lack of quality data in the right format and lack of hardware computing resources (especially if you are in a place like India).

Part of the problem is we don’t have many options to choose from. The first is setup your own ML rig with physical GPUs, screens, custom processors and everything. This is by far the best and most cost effective option (details are drilled down by many over web around the cost comparison) but the problem is you need to have a little grasp of how these hardware resources work and interact with each other. For instance, installing CUDA and all CUDA related dependencies just to get your NVIDIA GPU work properly is not a skill-set every ML enthusiast might have. Plus the physical maintenance of a system like this and initial investment (which can be recovered once you regularly use the system though) can be huge.

The next option is of course to go for Cloud GPUs. This always sounds pretty simple to until we ourselves got our hands dirty on this. Popular option after all things taken into consideration always comes out to be either one from AWS, GCP, Paperspace or FloydHub. Personally speaking Paperspace is the best option since they provide you with an Ubuntu ML loaded Virtual Machine directly accessible. The problem with being in India is the lag, even a simple scroll lags a lot so forget about coding up the Neural Nets there. FloydHub is good but way more complex to use. Plus they provide options involving specific models like TensorFlow Object Detection API which might be good for experimenting but deploying your custom models can be very challenging. This leaves us with the biggies: AWS and GCP.

AWS comes natural winner over Google Cloud in terms of support, documentation and information available online. GCP is good and is catching up but it’s just that AWS’s ecosystem is so mature that newbies prefer to not risk it with GCP or Alibaba Cloud for that matter. Also AWS Free Tier is something that attracts a lot of customers. But AWS GPU isn’t as simple as getting a VM with desktop preloaded with everything you need to run ML programs. Primarily speaking there are different options to train your ML models like using AWS SageMaker’s own pre-defined AI models or using your own model containers. Containerizing anything related to Deep Learning or ML is a huge task especially for new enthusiasts because it needs a lot of head banging related to Docker, Cloud and all of the related stuffs. So many tend not to get into containers or deploying your own full blown model on AWS; rather the most easy to get started option is to start with AWS SageMaker’s preloaded AI models.

Quick Walkthrough SageMaker Modelling Process

The image below shows all the major offerings of SageMaker in the order of tasks that we defined above for any ML project.

Starting with, we have Ground Truth which is SageMaker’s service for labelling data. It’s a pretty simple service akin to LabelImg where you can easily mark datasets for Object Detection, Image Classification, etc. Currently Ground Truth allows data labelling for below mentioned tasks (using either your own or AWS partner workforce).

Ground Truth is very promising but it can get kind of costly, it’s always a better choice to do labelling stuff using Open Source options until it’s a lot of data. Also, if the data is very specific to your domain it can get a little tricky to let AWS Partner Workforce know what exactly to label. The video here shows detailed step-by-step guide on how to use AWS Ground Truth.

Moving onto the Notebook instances we have Jupyter Notebook offerings from AWS SageMaker where you can code up your AI program and run. This is the best option if you have simple tasks and there are not much underlying dependencies in your code. Notebook instances are always best to experiment and see how things are going but may not be very ideal for an industry standard model offerings.

The thing which makes SageMaker to be used by many is it’s Training Job feature. Again you have multiple options here ranging from bringing in your own algorithm or use SageMaker’s predefined ones. We chose to go with standard Object Detection Model given by SageMaker. It uses the classic SSD architecture so it’s a safe bet for many decent object detection tasks.

After clicking on ‘Create a Training Job’ there are multiple settings you need to go through. The official AWS SageMaker documentation highlights the steps but here are couple of things which can be tricky.

The underlying GPU to choose from can be daunting. There are many options but the best one to go with as a starter is ml.p2.xlarge. This costs about $2.4/Hr. from APAC Mumbai region which is not very cheap but still very economical from other options. Details around different ML instances available in SageMaker can be found here. Keep in mind that this instance is still not available in APAC region directly so you need to chose a US region and have your S3 bucket also in the same region to work. Also, if you are a starter in Free Tier phase this instance will not be available to you right away and you will have to contact AWS Support to get this.

The next thing is the training, validation and testing dataset which AWS SageMaker requires. If you are using Ground Truth it will automatically generate the labels in the required JSON format that SageMaker needs but in case you are using your own labelling software you would need to convert the JSON you get into the once that SageMaker requires.

We’ll be sharing our script to convert data labels from LabelImg to AWS SageMaker JSON format in our subsequent posts.

The training with AWS GPUs can be pretty fast compared to running training jobs on your local laptops. But the results might not always be as desired and here is when Hyper-parameter Tuning comes into play.

Hyper-parameter Tuning can be an arduous job but it’s okay to start with standard hyper-parameters outlined by AWS SageMaker Object Detection official tutorial Jupyter Notebook here. But again it might not apply straight-away into your task. The other option is to use the hyper-parameters given by TensorFlow’s Object Detection Model Zoo and play around with those. The config files of various TensorFlow Object Detection Models containing their hyper-parameters can be found here.

But be warned, training and tuning an ML algo is by far the most toughest aspect of the task and can take anywhere from days to weeks to get the right parameter values.

Training and then manually testing your object detection model requires you to deploy your trained model everytime by creating an endpoint and then running it from your AWS CLI for a test image to visualize the results. This last part is called inferencing and you need not use previous GPU instances for the same. Since this part is not that much compute heavy an ml.m4.xlarge instance can do the job here perfectly well (until you really want to get the model into industry grade production where you have strict SLAs to meet).

From visualizing to running your endpoint from AWS CLI requires a couple of scripts which you need to code up. We will be sharing the same in our next posts but for now you can refer to the official Object Detection Tutorial Jupyter Notebook to get the visualization code snippet and run it on your local machine during inference.

Once the training has been done you would need to visualize the results. In case you are not using the Notebook instances of SageMaker, the below script can do the visualization work for you.

Also don’t forget to delete/end your endpoint otherwise it can shoot your AWS bill.

Do I need to know something else about AWS SageMaker?

Yes, absolutely, a lot in fact !!!

This article only touches the surface of AWS SageMaker ecosystem. Specifically we picked up only Object Detection task to give a flavor of how an end-to-end process in SageMaker looks like. Different tasks like NLP and others will have their own unique set of problems. Also we only talked about using SageMaker’s built in algorithms and models, most of the time in practical situations you would want your own model to be trained, deployed and shipped out as a container around which our subsequent posts will talk about. In many of these options AWS provides support for popular frameworks like PyTorch, TensorFlow, etc. but you need to check how much flexibility and freedom to access these layers is provided.

In addition to this, there is another option to train and deploy your AI models without using SageMaker as is, you can directly use AWS GPU instances as IaaS and build your own models right from scratch without using SageMaker’s built in support for popular models like TensorFlow, MXNet, etc. (although this is a pretty tough route to take).

Conclusion

The primary goal of this article was to give a sneak peek into AWS SageMaker which provides really easy to use services to move into AI and ML domain experimentation. There are still doubts over how robust SageMaker is over your own physical ML gig with physical GPUs but the high initial cost and investment of time and effort can significantly discourage many AI enthusiasts in getting started with. As an example, just configuring CUDA libraries for an NVIDIA GPU on your MacOS can be a huge nightmare and take days to setup and run as desired.

Wrapping this up there are few things which people need to understand is understanding core ML concepts and mathematics is one thing but bringing in your model into production requires a good understanding of cloud concepts in today’s time. It’s one thing to run TensorFlow Object Detection API on your local machine and other to ship a software for which you can get potentially get paid. The other thing is about privacy and data leakage issues which inherently comes up with any AI model. As a startup or individual AI consultant you need to understand how those things can be handled in a cloud ecosystem since many are now adopting AWS as their primary go-to infra provider.

In our next subsequent posts we will be focussing more on custom scripts required for Visual Intelligence Models running on SageMaker. Also we will focus on how to train and run your own TensorFlow, PyTorch and Keras models using SageMaker GPU instances leveraging containers.

Hope this helped in some way.

Cheers !!!

Team TechStreet

--

--

Aditya Khokhar
TechStreet

Technology | Markets | Books | Cinephile | History