My one month experience using AWS SageMaker

Filipe Pacheco
5 min readOct 15, 2023

--

… from the previous episode

As a follow-up to my previous Medium post on the ever-evolving nature of data science and the need for continuous upskilling, I want to share some insights on the second stage of my upskilling plan. In this post, I will be discussing the AWS implementation of a task and its relevance to modern-day data science.

  • LLM — Large Language Models
  • Upskill from ML in AWS
  • Become MultiCloud Practitioner

Summary

As I continued my journey to upskill as a data scientist, I knew that I needed to find a course that could help me fast-track my learning and progress. After some research, I stumbled upon Udemy’s “Become an AWS ML Engineer” course, which piqued my interest. I had been wanting to delve deeper into cloud environments like AWS, and this course seemed like the perfect opportunity to do so.

For the second stage of my upskilling plan, I focused on ML in AWS, with a particular focus on utilizing the SageMaker service. This experience provided me with my first practical, hands-on exposure to AWS. Throughout the month of September, I delved into the course material and worked with a variety of AWS services, including IAM, S3, EC2, Lambda, and SageMaker. While I won’t go into great detail regarding the first four services in the last sentence (as I’m covering them in other posts more related to MultiCloud Practitioner), I’ll be sharing my insights on using SageMaker.

AWS SageMaker

Amazon SageMaker is a comprehensive machine learning (ML) service that’s purpose-built for data scientists and developers. Its capabilities include everything from data preparation and model building to efficient and reliable model training and deployment. While SageMaker is just one of many ML applications offered by AWS, it’s specifically tailored to enhance the productivity of those working with data, with a particular focus on ML development.

It’s worth noting that since I work with Databricks on a daily basis, my experiences and insights on SageMaker may be biased due to my prior knowledge and familiarity with Databricks. Nevertheless, I will share my honest reflections and observations on my experience with SageMaker.

Ground Truth

The first service I utilized within the SageMaker domain was Ground Truth. This service is specifically designed for users who need to label unlabeled data. For instance, if you’re creating a new type of ML model, but are unable to find any labeled data online, Ground Truth can be immensely helpful. The user has the option to choose AWS AutoLabeling or even pay third-party workers to do the labeling task for them.

If your project requires completely new labeling, Ground Truth should definitely be on your radar. I’ve never encountered another service that offers such a structured approach for scaling and ensuring reliability in the labeling process, even though in Databricks, which I would say, Marketplace can help you, but not as the same way as Ground Truth.

Canvas

AWS recently introduced a new service called Canvas, which takes the No-Code approach to a whole new level. With Canvas, you can import your dataset from various sources (both inside and outside AWS), select the type of training, and choose from at least 10 different types of ML models, all without having to write a single line of code. Additionally, you have the option of selecting either a fast or more detailed training of the ML model.

Employee-turnover Classification Model.

One of the things that really struck me about SageMaker Canvas is how visually appealing it is. As a No-Code solution, it needs to have a clean, user-friendly interface that guides the user effortlessly through the process of training the model.

Another intriguing feature of Canvas is the ability to train the same model with different hyperparameters or even different datasets, and to save and register each version of the model for future use. This functionality streamlines the MLOps process, making it easier and more accessible to users. Looking toward the future, I can definitely see how technologies like Canvas could play a role in revolutionizing the entire MLOps pipeline.

Studio

According to AWS, SageMaker Studio is the first fully-integrated development environment (IDE) for ML. While I wouldn’t necessarily label Databricks as an IDE, it does deliver a robust platform for all aspects of data and analytics development. Nevertheless, I found SageMaker Studio to be a compelling platform, especially given its wide range of features.

AWS SageMaker Studio.

If I were not already using Databricks, SageMaker Studio would definitely be my preferred platform for creating ML models. The Studio provides a seamless interface with Jupyter Notebook, allowing you to develop your own Python code on the cloud and choose your EC2 instance to meet your needs.

Beyond this capability, there are four additional features that stood out to me:

  • Integration with Git platforms: SageMaker Studio allows you to choose between CodeCommit, GitHub, and GitLab for code versioning.
  • Data treatment: The platform features No/Low-Code data treatment and even includes a feature store application for model training.
  • AutoML: With the support of AutoML, I found it to be a simple and effective way for setting up hyperparameters and focusing on the feature-store during my free time.
  • MLOps: Lastly, it’s worth mentioning that SageMaker Studio also boasts robust MLOps capabilities, including model versioning and quick and easy endpoint creation for hosting ML applications.

Comparison Databricks and SageMaker

Both Databricks and SageMaker are powerful platforms for developing machine learning models. Databricks offers a comprehensive, end-to-end platform with seamless integration of MLflow for MLOps capabilities, collaborative dashboards and notebooks, and the ability to build custom libraries for large-scale machine learning projects. However, Databricks can be costly for smaller scale projects.

On the other hand, SageMaker provides a user-friendly platform with no-code capabilities and AutoML. It offers tools such as Jupyter notebooks, full integration with Git platforms, excellent support for data treatment, and easy MLOps. However, being a closed-source platform, it can limit customization for advanced data scientists and its simplistic nature can sometimes offer a lack of control over the machine learning development process. Its costs can quickly add up if you’re using many of its features.

When choosing which platform to use, it’s crucial to consider your specific use case, such as the complexity of the ML model, team size, required computing resources, and available budget.

Conclusion

In short, AWS SageMaker and Databricks provide powerful machine learning tools for data scientists and developers. SageMaker is a user-friendly no-code platform, with Ground Truth, Canvas and Studio services, while Databricks offers a comprehensive end-to-end platform with seamless MLflow integration.

While I haven’t yet compared the AutoML capabilities of both AWS SageMaker and Databricks, it’s something that I plan to explore in my future work.

If you’re interested in exploring the codes that I developed during my journey of upskilling, you can find them posted in my personal GitHub account at the following link: Repo. Given my use of Jupyter Notebook, you’ll be able to see the output of the commands, which I hope you’ll find helpful and informative.

--

--

Filipe Pacheco

Senior Data Scientist | AI, ML & LLM Developer | MLOps | Databricks & AWS Practitioner