Machine Learning on AWS: SageMaker vs. DIY

Published in

Version 1

8 min readAug 16, 2022

I’ve been interested in machine learning ever since taking Andrew Ng’s machine learning course on Coursera back in 2015. This course isn’t available for new registrations sadly, but the deep learning specialisation is also well worth a look if you’re interested.

I’m a nuts-and-bolts type of person who likes to understand how everything works. The Coursera course was great for this, even if I was out of my depth with some of the maths. When I started dabbling with Kaggle competitions to practice what I’d learnt this mindset took me down a DIY rabbit hole of Python, scikit-learn, TensorFlow and Spark (with a bit of Scala thrown in just to keep it interesting).

I spent weeks learning the details and perfecting my model for Kaggle’s Titanic competition. A friend from work and I discussed the project. He went home that evening, plugged the training data into AWS’s SageMaker, and came out with a score I’ve not been able to beat since. I secretly think he might have got lucky or put in more work than he let on, but even so, it shows the power of the tools available.

If you’re not familiar with SageMaker, AWS describe it as:

a fully managed service to build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.

Amongst other things, it provides a hosted notebook that you can use to build models and make predictions. It also provides Autopilot which takes care of pretty much all of the hard work. Upload your data, configure the job, and it’ll do the feature engineering, algorithm selection, and hyper-parameter tuning for you.

Over recent years, I’ve spent much more time using AWS than I have on machine learning, so seeing how well I can do on a Kaggle challenge with SageMaker seems well overdue. Perhaps now is the time — five years after my first Kaggle submission — to catch up with Will from my old job…

What is the Titanic Challenge

Kaggle is a great place to start with machine learning. There are loads of practical challenges complete with datasets and a community discussing their approach to solving each problem. The Titanic challenge is their introductory competition, something like a “Hello World” for data science.

The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

The challenge’s training data is small: just 61KB over 900 records, so building a model shouldn’t take much computing power. The data lends itself to several algorithms, including logistic regression and random forest classification, both of which are - for machine learning algorithms - relatively intuitive.

The DIY Approach

I started by installing Jupyter notebook on my laptop, since I already had Python up and running. Then I installed the libraries that I’d need: numpy, pandas and scikit-learn. Next, I jumped into the notebook and built my solution:

Load the data into dataframes.
Split the data into a training and test dataset.
Prepare the data, e.g., filling blanks, deriving additional features.
Build a random forest model.
Use the model to predict outcomes for the ‘secret’ records that Kaggle hold back to judge on.

The full code is on GitHub if you’d like to take a closer look. It gave me a reasonable score of 0.77990 (78% accuracy).

The AWS Options

Costs

If you’re following along with this blog, be aware that AWS resources are chargeable. The free tier is helpful for experimentation, but if you go beyond this, bills from large instances quickly mount up.

The free tier covers up to 250 hours of medium-sized studio or notebook instances per month. There are up to 744 hours in a month so don’t forget to turn them off when you’re not using them! It also covers up to 50 hours of extra large instances for training, but with Autopilot you’ll most likely go beyond this plus use bigger instances, so keep an eye on costs. I should have known better but learnt this the hard way last month.

Check the SageMaker pricing and consider setting billing alerts to give you an early warning of mounting costs. Don’t forget to turn off your instances and clear up resources when you’ve finished.

DIY Notebook

It is possible to take the code from my DIY notebook, paste it into a SageMaker hosted notebook, change a couple of lines of code to load the data from an S3 bucket rather than a local directory, and get the same results. In fact it’d be easier than going DIY because there’s no need to install dependencies. As well as being easier, you’d have access to powerful cloud infrastructure to train your model on. It’s not a problem running a 1,000 record random forest on your laptop, but you might be waiting longer than a coffee break training a neural network on a huge dataset.

AutoPilot

For this article, I wanted to take full advantage of SageMaker to demonstrate how simple the process can be. That’s why I decided to use SageMaker Autopilot. The steps I describe next are for comparison of the different approaches. If you’re looking for a tutorial or detailed explanation, there are better resources:

This AWS video: Automatically Build, Train, and Tune ML Models With Amazon SageMaker Autopilot
The AWS blog announcing Autopilot
The SageMaker examples that come with your Jupyter notebook

High level steps in the notebook are:

Import the required libraries.
A quick review of our training data.
Configuring the AutoML job.
Running the AutoML job.
Making predictions with the best performing candidate model.

Running the AutoML job is where the hard work happens. Autopilot analyses training data, runs feature engineering jobs, and tunes the model automatically.

This job took just over 75 minutes: four minutes for analysing data, nine minutes for feature engineering, 48 minutes to tune the model, and 8 minutes each for creating an explainability and insight report. That’s longer than I expected and something I’ll look into in future. I expect the job could be optimised to reduce the time taken and costs incurred.

Here’s a link to the notebook in my GitHub for reference, but if you’re interested in doing this yourself you’ll be far better starting off with the autopilot California housing notebook example. That’s all I did, so better to start with the original.

There’s still a reasonable amount of code, but look closer and you’ll see that the vast majority is boilerplate. Change the data sources from the example and you’re pretty much ready to go. I’ve not done any analysis or feature engineering, SageMaker did it all. Once you understand how the tool works you could drop in a new data set and tune a completely different model with just a few minutes of work and a short wait for the automated tuning. Autopilot also suggests ways in which the data could be improved; these are available through the SageMaker studio in notebooks created by each job. The video linked earlier shows how to use these.

Outcome

I didn’t deploy an endpoint, since this was a one-off job. Instead, I ran a transformation to run the input data through the same pipeline (feature engineering etc.) as our training data and inferred from there. I downloaded the predictions from S3, manually added headers and a passenger ID column to suit the format expected by Kaggle and submitted the data. The result was a slightly disappointing score of 0.77511 (77.5% accuracy). Disappointing because it’s 0.4% worse than my DIY effort, but that’s still impressive since I didn’t have to do any feature engineering, algorithm selection, or hyperparameter tuning. With a bit more time, effort, and an improved understanding of the toolset I’m confident of improving by another couple of percentage points.

Conclusion

I saw similar end results from Autopilot as I did from my DIY analysis. Compute for my DIY analysis was free whereas Autopilot cost me around $20. The big benefit is that whilst Autopilot took me an evening of hacking example code I didn’t fully understand, the DIY approach took me many evenings and weekends of study, loads of data analysis and prep, and hours of experimentation. Instead of needing to know data science, you can get good results by knowing how to use the tool with a bit of python.

For the DIY approach, a proper practitioner could have done far better than I did in much less time. But then they could have done even better, in even less time, with a tool like SageMaker.

If I was doing this for work, Autopilot would have been the right choice. The time saving was easily worth $20. For a hobby, I’ll stick with my notebooks. I learned far more that way and enjoyed the journey.

What Else Can SageMaker Do?

I’ve only used part of SageMaker so far. It’s a much more powerful tool than a hosted notebook with some clever libraries.

Studio gives a full environment for your machine learning work.
Ground Truth allows you to outsource the manual labelling of data, making collecting training data easier. Mechanical Turk for spreadsheets.
Models can be deployed to hosted endpoints allowing predictions through API calls. These can be configured to be as highly available and performant as you’d expect from a cloud service and become part of your application.
Workflows are available for a software engineering pipeline, with version control, approvals and automated deployment.
Access to hardware optimised for model training through AWS Neuron and tools such as SageMaker Neo which will optimise models for inference on different platforms.

More to Learn

If you want to find out more, there are plenty of helpful resources online. My favourites so far have been:

Kaggle: practical data science challenges, interesting data sets to experiment on, and a knowledgeable and supportive community to discuss solutions with.
AWS AI/ML Specialty: a professional certification which focuses on deploying, operating and streaming data into AWS machine learning resources. Particularly useful when productionising your machine learning workload in the cloud.
Open AI Deep Learning Specialisation: an in-depth study of deep learning, including the theory behind it and best practice in using it. Much deeper and more theoretical than the AWS speciality. More academic than practitioner.