Building reliable machine learning pipelines with AWS Sagemaker and Comet.ml

7 min readJul 31, 2019

This tutorial is Part II of a series. See Part I here.

Successfully executing machine learning at scale involves building reliable feedback loops around your models. As your pipeline grows, you will reach a point where your data can no longer fit in memory on a single machine, and your training processes will have to run in a distributed way. Regular retraining, and hyper-parameter optimization of models, will become necessary as new data becomes available and your underlying feature distributions change.

There is also the added complexity of trying new modeling approaches on your data and communicating the results across teams and other stakeholders.

In this blog post, we will illustrate how to use AWS Sagemaker and Comet.ml to simplify this process of monitoring and improving your training pipeline.

Data complexity + model needs are growing

The challenges involved in creating functional feedback loops extend beyond data concerns like tracking changing data distributions to issues at the model level. Robust models require frequent retraining so that hyperparameters are optimal in the face of new, incoming data.

With each iteration, it becomes harder to manage subsets and variations of your data and models. Keeping track of which model iteration ran on which dataset is key to reproducibility.

Take that complexity + multiply it by team complexity

Managing these complex pipelines can become even more difficult in team settings. Data scientists will often store results from several models in log files, making it impossible to reproduce models or communicate results effectively.

Traditional software development tools are not optimized for the iterative nature of machine learning or the scale of data machine learning requires. The lack of tools and processes to make collaboration easy across members of the same data science teams (and across functions like engineering) has led to dramatically slower iteration cycles. Organizations also constantly suffer the pain of slow on-boarding times for new employees and bear the risk of employees churning along with their work and proprietary knowledge.

Tutorial

This tutorial covers how to integrate Comet.ml with AWS Sagemaker’s Tensorflow Estimator API. We will be adapting running the Resnet model on the CIFAR10 dataset with Tensorflow.

Instead of logging experiment metrics to Tensorboard, we’re going to log them to Comet.ml.

This allows us to keep track of various hyperparameter configurations, metrics, and code across different training runs.

AWS SageMaker provides convenient and reliable infrastructure to train and deploy machine learning models.
Comet.ml automatically tracks and monitors Machine Learning experiments and models.

You can check out Comet.ml here and learn more about AWS Sagemaker here.

Environment Setup

When using AWS Sagemaker, your account comes with multiple pre-installed virtual environments that contain Jupyter kernels and popular python packages such as scikit, Pandas, NumPy, TensorFlow, and MXNet.

Create a Sagemaker account. You’ll be guided through all the steps for getting your credentials to submit a job to the training cluster here.

2. Set up your Comet.ml account here. Once you login, we’ll take you to the default project where you’ll see the Quickstart Guide that provides your Project API Key.

3. Create a Sagemaker notebook instance, and start a new terminal from this instance.

4. Open up a new Terminal instance (using the New dropdown). Using the command line, activate the tensorflow_p36 virtual environment

$ source activate tensorflow_p36

Using the terminal, clone the Sagemaker example from https://github.com/comet-ml/comet-sagemaker into your Sagemaker instance

$ git clone https://github.com/comet-ml/comet-sagemaker.git && cd comet-sagemaker && pip install -r requirements.txt

Integrating Sagemaker Estimator API with Comet.ml

The Sagemaker Estimator API uses a Tensorflow (TF) EstimatorSpec to create a model, and train it on data stored in AWS S3. The EstimatorSpec is a collection of operations that define the model training process (the model architecture, how the data is preprocessed, what metrics are tracked, etc).

In order to get Comet to work with the TF EstimatorSpec, we will have to create a training hook that wraps our Comet Experiment object in a TF SessionRunHook object. We will adapt code from the tf.train.LoggingTensorHook to log metrics to Comet. You can find a reference implementation here.

Initializing your Comet.ml Project

Add your Comet.ml API key, project name, and workspace name to the Experiment object in CometSessionHook . You can find the Experiment Object here
We initialize the CometSessionHook object in the model_fn function within the resnet_cifar_10.py file. Themodel_fn function defines the tensors and parameters we would like to track.
The hyperparameters for this experiment are set in main.py, but feel free to adjust these default values. The hyperparameters for this model include: learning rate, batch size, momentum, num_data_batches, weight decay.
Run python main.py or ! python main.py if running from within a notebook.
You will receive a message in the Sagemaker Notebook output that your experiment is live in Comet.ml with a direct url to the experiment.
You change the parameters in the main.py file and rerun the script.

Monitoring experiments in Comet.ml

Once you run the script, you’ll be able to see your different model runs in Comet.ml through the direct url. As an example for this tutorial, we have created a Comet project that you can view at https://www.comet.ml/dn6/comet-sagemaker.
Let’s see how we can use Comet to get a better understanding of our model.

First let’s sort our model based on accuracy by clicking on the accuracy column in our experiments table. These columns are populated based on the hyperparameters provided by the user, or automatically inferred from command line arguments in the code. They can also be customized to track the relevant parameters, and metrics for your model using the Customize Columns option.

Now let’s add a few visualizations to our project so that we can see how are hyperparameters are affecting our experiment.

Then let’s use a line chart to visualize our learning curves, a bar chartto track our final accuracy scores across experiments, and a parallel coordinates chart to visualize which parts of our hyperparameter space are producing the best results.

We can filter our experiments using Comet’s query builder, so that we only analyze the models that meet our threshold criteria. For example, perhaps you only want to see experiments above a certain accuracy rate.

We can see that our charts all change based on the filters applied to the experiments.

Our parallel coordinates chart is able to show us parts of the parameter space that we have explored, as well as where we have gaps. In this example, we see that a larger Resnet Size, and Batch Size produces a more accurate model.

However, this best performing experiment was also allowed to run much longer than the others. In order to get a better estimate of the model performance, we can exclude this from our charts by hiding it in the experiment table

Our charts table updates to reflect our Experiment Table, and we can now see that given a similar training time, a larger Resnet Size does not necessarily produce the best results.

We can use these insights to continue iterating our model design, until we have satisfied our requirements.

Comet.ml allows you to create visualizations like bar charts and line plots to track your experiments with along with parallel coordinate charts. These experiment-level and project-level visualizations help you quickly identify your best-performing models and understand your parameter space.

If you’d like to share your results publicly, you can generate a link through the Project’s Share button. Alternatively, you can also directly share experiments with collaborators by adding people as collaborators in your Workspace Settings.

Tutorial Summary

You have learned how to prepare and train a Resnet model using Comet.ml and AWS Sagemaker’s Tensorflow Estimator API. To summarize the tutorial highlights, we:

Trained a Resnet model on the CIFAR-10 dataset in an Amazon Sagemaker notebook instance using one of Sagemaker’s pre-installed virtual environments
Explored a different iteration of the training experiment where we varied the hyperparameters
Used Comet.ml to automatically capture our model’s various hyperparameter configurations, metrics, and code across different training runs.

Sign-up for a free trial of Comet.ml here.