Screenshot from https://aws.amazon.com/sagemaker/

Machine Learning Workflow with SageMaker

Matthieu
Akeneo Labs
Published in
7 min readFeb 12, 2018

--

So you’re working on Machine Learning, you’ve got prediction models (like a neural network performing image classification for instance), and you’d love to create new models.

The thing is: each time you create a model, you have to train it and evaluate it. You may want to train on several powerful machines at once, generate a report about its performance, and also compare the models.

Yes? No? Anyway that’s what I’ve been experiencing.

Now here comes Amazon SageMaker, which offers a workflow that structures the way I can do those operations, while leveraging AWS horsepower. And I like it. The problem is: it’s new, so documentation is still rather poor, and most of the time asking Google doesn’t fire any StackOverflow thread nor post answering my questions yet.

So I’ve decided to address this last issue by writing this post as a reference for my colleagues, for me in the future, and hopefully for other SageMaker enthusiasts out there. It’s provided as is. Feedback is welcome, especially if you’ve got better solutions :)

BTW, this post is not about explaining how SageMaker works. It’s more about tips & tricks. I strongly encourage you to read the documentation as well: https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works.html

Q1) How to run a SageMaker Notebook locally, from a laptop?

You can run a notebook directly on a SM (let’s abbreviate SageMaker as SM from now on) notebook instance. So why would you want to run it on my laptop? Well, because:

  • starting a notebook instance takes 6 minutes
  • I want to use the debugger to debug some functions I call from the debugger, and I can’t do this from the notebook
  • I want to commit/push/pull my notebook, and doing this from a remote notebook is not straightforward
  • I don’t want to have a notebook instance running the whole night long just to keep my notebook variables and sessions alive.
  • I may be offline in the airplane. Why should this prevent me from creating and troubleshooting a custom TensorFlow estimator?

SM’s documentation claims that you can run the notebook 1) from a notebook instance and also 2) from your laptop. And they are right. But while the first one is well explained, the second lacks documentation. So here is how to do it.

Notebooks provided as example begin with the following code:

import os
import sagemaker
from sagemaker import get_execution_role
sagemaker_session = sagemaker.Session()role = get_execution_role()

Let’s run install awscli and sagemaker pip package, run a Jupyter notebook on my laptop, and run the code. The last line fires an error:

The current AWS identity is not a role: arn:aws:iam::4242424242:user/matthieu therefore it cannot be used as a SageMaker execution role

Indeed, when you create a SM notebook instance in AWS console, you assign it a role. It’s used as a credential to train models, retrieve datasets, etc. But I didn’t assign any such role on my laptop. I just configured awscli to use my user AWS credentials.

I tried to solve the problem by setting awscli configuration in the file~/.aws/config, as describe here. But nothing of the sort helped. (Still, I think it would be the best way to go. The solution might be in boto3’s documentation)

So here is my solution:

# forget about role = get_execution_role(). Instead do:
role = 'arn:aws:iam::4242424242:role/matthieu' # copy/paste the ARN of the role attributed to a notebook instance
# It should contain ":role/" instead of ":user/"

Now, it would be great for your notebook to work both on your laptop and on a real SM Notebook instance. And also to avoid storing the role in your code. So I created the following function:

import sagemaker


def get_execution_role(sagemaker_session=None):
"""
Returns the role ARN whose credentials are used to call the API.
In AWS notebook instance, this will return the ARN attributed to the
notebook. Otherwise, it will return the ARN stored in settings
at the project level.
:param: sagemaker_session(Session): Current sagemaker session
:rtype: string: the role ARN
"""
try
:
role = sagemaker.get_execution_role(sagemaker_session=sagemaker_session)
except ValueError as e:
try:
# read your role from a configuration file, from environment variables, etc. It's up to you
from some_module_of_your_project import get_sagemaker_role_from_settings
arn = get_sagemaker_role_from_settings()
except ImportError as ee:
print('Could not import module to load the role from settings')
raise ee
if ':role/' in arn:
role = arn
else:
message = 'The current AWS identity is not a role: {},' \
'therefore it cannot be used ' \
'as a SageMaker execution role'
raise ValueError(message.format(arn))
return role
role = get_execution_role() # Voila! This works everywhere

Q2) How to download a file from S3?

It’s an easy one, probably covered, just much harder to find that the function upload_to_s3() which is in almost all the samples. So here it is:

import os
import
boto3 # Python library for Amazon API
import botocore
from botocore.exceptions import ClientError
def download_from_s3(url):
"""ex: url = s3://sagemakerbucketname/data/validation.tfrecords"""
url_parts = url.split("/") # => ['s3:', '', 'sagemakerbucketname', 'data', ...
bucket_name = url_parts[2]
key = os.path.join(*url_parts[3:])
filename = url_parts[-1]
if not os.path.exists(filename):
try:
# Create an S3 client
s3 = boto3.resource('s3')
print('Downloading {} to {}'.format(url, filename))
s3.Bucket(bucket_name).download_file(key, filename)
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == "404":
print('The object {} does not exist in bucket {}'.format(
key, bucket_name))
else:
raise
def upload_to_s3(channel, file):
"""From SM examples. Like here: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/imageclassification_caltech/Image-classification-transfer-learning.ipynb"""
s3 = boto3.resource('s3')
data = open(file, "rb")
key = channel + '/' + file
s3.Bucket(bucket).put_object(Key=key, Body=data)

Q3) How to reload a model and an endpoint from a notebook?

Yesterday I trained a model with estimator = Tensorflow(...) and estimator.fit(...). Then I created an endpoint based on this model with predictor = estimator.deploy(...).

Then I turned the notebook off, and lost my precious estimator and predictor objects. I can redeploy the model from AWS console in the browser, but how to do it programmatically?

First, I can attach to a previous training job (whose name is in logs, or at least on AWS console, or I could have printed it with estimator.latest_training_job.job_name)

from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow.attach(training_job_name='sagemaker-tensorflow-py2-cpu-2018-01-01-01-01-42-042')
# cf https://github.com/aws/sagemaker-python-sdk/issues/31

I can then deploy my model again with predictor = estimator.deploy(...) if the training step has been completed, or reload an existing endpoint with the following code:

from sagemaker.tensorflow import TensorFlowPredictor

predictor = TensorFlowPredictor('my_existing_endpoint_name')
result = predictor.predict(['my request body'])
# cf https://github.com/aws/sagemaker-python-sdk/issues/36

Q4) How to test a TensorFlow model before training on SageMaker?

After creating an estimator like this:

# Inspired from the distributed MNIST example: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_distributed_mnist/tensorflow_distributed_mnist.ipynbestimator = TensorFlow(
entry_point='my_estimator.py',
output_path=output_path,
role=role,
training_steps=10000,
evaluation_steps=100,
train_instance_count=8, # train on several instances in parallel
train_instance_type='ml.p2.xlarge', # BTW, by default you can use only one such instance at once. In case you experience this limitation, you can send a ticket to the support of AWS
hyperparameters=hyperparameters,
)

It can be uploaded to instances on AWS to start training like this:

estimator.fit(train_data_location)

After 6 minutes required to deploy the model, I saw that the model I described in my_estimator.py didn’t work because of syntax errors, tensor dimensions, etc. After 6 minutes is the most important part of the previous sentence.

So let’s solve this issue.

What happens in this case is that we create a custom Tensorflow Estimator. This is a high-level interface of Tensorflow which is pretty well documented on Tensorflow website (here and there), and also on SageMaker’s documentation (here and here). I won’t cover it all again, instead I advise you to read the links related to TensorFlow if you are not familiar with this API. But let’s explain the relationship between TF estimators and SM here.

Taking a look at the same MNIST example as above, we can see that we have to create functions very similar to those describing a TF Estimator:

# inspired from https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_distributed_mnist/mnist.pydef model_fn(features, labels, mode, params):
# ...
def serving_input_fn(params):
# ...
def train_input_fn(training_dir, params):
# same as the train input function of a TF estimator,
# except for the training_dir argument.
# ...
def eval_input_fn(training_dir, params):
# same as the eval input function of a TF estimator,
# except for the training_dir argument.
# ...

So here is the code I run before trying to fit() a model on AWS:

"""This script aims to run lines of code that triggers most errors that the
Tensorflow Estimator will encounter once deployed to a machine to fit
a dataset. The goal is to avoid the overhead time of ~6min required to
start the fitting process on SageMaker.

To some extend, this code replicates what SageMaker does internally."""

import
os

import my_estimator
# import mnist as my_estimator to load model defined in ./mnist.py
import tensorflow as tf
from tensorflow.python.estimator.util import fn_args
import numpy as np

def adapt_args_to_SM(directory, input_fn):
input_fn_args = fn_args(input_fn)
kwargs = {}

if len(input_fn_args) == 2:
# In this case, we're dealing with a SageMaker input function,
# such as:
# def train_input_fn(directory, params):
# pass
def
wraper_function(params):
return input_fn(directory, params)
return wraper_function
else:
# In this case, we assume that
# we're dealing with a pure Tensorflow input function,
# such as: lambda: my_input_fn(FILE_TRAIN, True, 500)
return
input_fn


def sanity_checks():
tf.logging.set_verbosity(tf.logging.INFO)

hyperparameters = {
'bottleneck_tensor_size': 2048,
'dataset_name': 'inception_v3.tfrecords',
'dense_sizes': [],
'nb_categories': 394,
}

classifier = tf.estimator.Estimator(
model_fn=my_estimator.model_fn,
params=hyperparameters,
)

bottleneck_path = 'some/path'
print("About to start the training. You can monitor the network by "
"running the command `tensorboard --logdir={}`".format(
classifier.config.model_dir
))

# https://www.tensorflow.org/get_started/premade_estimators#train_evaluate_and_predict
classifier.train(
steps=5, # training 5 steps, just to check the network architecture
input_fn=adapt_args_to_SM(directory=bottleneck_path,
input_fn=my_estimator.train_input_fn),
)

eval_result = classifier.evaluate(
steps=5,
input_fn=adapt_args_to_SM(directory=bottleneck_path,
input_fn=my_estimator.eval_input_fn))

predictions = classifier.predict(
input_fn=adapt_args_to_SM(directory=bottleneck_path,
input_fn=my_estimator.predict_input_fn))

# performing prediction maybe a more realistic way: See
# https://github.com/tensorflow/tensorflow/blob
# /18003982ff9c809ab8e9b76dd4c9b9ebc795f4b8/tensorflow/docs_src
# /programmers_guide/saved_model.md#performing-the-export
print('Done')
pass


if
__name__ == '__main__':
sanity_checks()
# When this works, I'm confident I can try to fit() the model and it's worth waiting 6 minutes to make sure of it

Enjoy :) And thanks for you feedback.

--

--