Save Money and Prevent Skew: One Container for Sagemaker and Lambda

A tutorial on building a container for both services

Lou Kratz
TDS Archive
Published in
6 min readNov 19, 2021

--

Photo by Fabian Blank on Unsplash

After this article was written, AWS released Sagemaker Serverless Inference which you can just use instead.

Product lifecycles often require infrequent machine learning inference. Beta releases, for example, may only receive a small amount of traffic. Hosting model inference in these scenarios can be expensive: model inference servers are always on even if no inference requests are being processed.

A good solution to underutilization is serverless offerings such as AWS Lambda. These let you run code on-demand but only pay for the CPU time you use. This is an attractive environment for infrequent inference but often requires a different software stack and packaging restrictions. Thus, when it’s time to scale beyond serverless (i.e., when a feature goes to GA or we need GPU for large batches), engineers need to rewrite the inference code for real-time endpoints.

Rewriting inference code costs engineers time and introduces a risk of environment skew. Like training/serving skew, environment skew is model performance change due to differences in software dependencies or how the input data is handled. In other words, rewriting the inference code can unexpectedly cause your model to underperform.

In this tutorial, I show how to create a single docker container for both AWS Lambda and Sagemaker Endpoints. With this approach, you can start out running inference on serverless for infrequently accessed models, and then move to an always-on Sagemaker Endpoint when required. Thus you can keep costs low initially, and scale efficiently without the risk of environment skew or having to rewrite the inference code.

The full code is available on github.

A Fork In Your Container

Both Sagemaker and Lambda support bringing your own container. Sagemaker requires an HTTP server, while Lambda uses a runtime as an entry point. The key idea behind this project is to fork the container logic depending on which environment it’s running in, but use the same inference code:

Image by Author

Forking in entry.sh is actually rather easy: Sagemaker passes a serve argument to the container on launch, while Lambda passes the function handler name. Thus, our entry.sh script need only check the command line argument to know if it’s running in Sagemaker or Lambda:

#!/bin/bashif [ "${1:-}" = "serve" ] ; then
# Start an http server for Sagemaker endpoints..
exec /usr/local/bin/python serve.py
else
# run the Lambda runtime environment
exec /usr/local/bin/python -m awslambdaric ${1:-}
fi

(I’ve omitted the Lambda emulator for local testing, but it’s in the github repo)

The Inference Code

A key benefit to this approach is that you can use the same inference code in both environments. In this example, I’ll classify images using a model from the gluoncv model zoo, but the basic concept should be extendable to your own models.

Making your inference code modular is key here: the serve.py and lambda.py files should only have minimal logic for validating input. To achieve this, inference.py has two primary functions:

  • load_model — loads the model into memory
  • infer — takes the input and the model, returns inference results.

Loading models requires some care, since Sagemaker can deploy with models from the model registry, and Lambda can’t. When this happens, the artifact is mounted to /opt/ml/model in the container, which we’ll identify by the variable model_path :

def load_model(model_path = None):
"""
Loads a model from model_path, if found, or a pretrained model
specified in the MODEL_NAME environment variable.
"""
# Try to load from model_path if we are running on sagemaker.
if model_path and os.path.exists(model_path):
symbol_file = glob(join(model_path, '*symbol.json'))[0]
params_file = glob(join(model_path, '*.params'))[0]
return SymbolBlock.imports(symbol_file, 'data', params_file)
else:
# running in lambda, so load the network from the model zoo
model_name = os.environ['MODEL_NAME']
return gluoncv.model_zoo.get_model(model_name,
pretrained=True, root='/tmp/')

In the Lambda case, the code above lets the user specify the model in the MODEL_NAME environment variable and loads the model from the model zoo. Alternatively, you could package the model in the container, or load it from S3.

The infer function is then a straightforward example of inference with the gluon API:

def infer(uri, net):
"""
Performs inference on the image pointed to in `uri.`
"""
# Download and decompress the image
img = open_image(uri)
# Preprocess the image
transformed_img = imagenet.transform_eval(img)
# Perform the inference
pred = net(transformed_img)
prob = mxnet.nd.softmax(pred)[0].asnumpy()
ind = mxnet.nd.topk(pred, k=5 [0].astype('int').asnumpy().tolist()
# accumulate the results
if hasattr(net, 'classes'):
results = [{
'label': net.classes[i],
'prob': str(prob[i])
} for i in ind]
else:
results = [{'label': i, 'prob': str(prob[i])} for i in ind]
# Compose the results
return {'uri': uri, 'results': results}

The Sagemaker Code

When the container launches in a Sagemaker Endpoint, it starts an HTTP server specified in serve.py. The server handles inference requests on the POST /invocations endpoint. I’ve used flask for this example, but any HTTP framework should work:

# HTTP Server
app = Flask(__name__)
# The neural network
net = None
@app.route("/ping", methods=["GET"])
def ping():
return Response(response="\n", status=200)
@app.route("/invocations", methods=["POST"])
def predict():
global net
# do prediction
try:
lines = request.data.decode()
data = json.loads(lines)
results = inference.infer(data['uri'], net)
except ValueError as e:
error_message = f"Prediction failed with error '{e}'"
return Response(response=error_message, status=400)
output = json.dumps(results)
return Response(response=output, status=200)

The invocations endpoint simply decodes the request and passes the input to the inference.infer function specified above. This keeps things simple: all logic is specified in the inference file.

The network itself should be loaded before starting the server in the main function, which also provides the possibility for some handle command-line arguments:

def parse_args(args=None):
parser = argparse.ArgumentParser(
description='Server for inference on an image.'
)
parser.add_argument(
"--model-path", type=str, default='/opt/ml/model',
help="The model artifact to run inference on."
)
parser.add_argument(
"--port", type=int, default=8080,
help="Port to run the server on."
)
parser.add_argument(
"--host", type=str, default="0.0.0.0",
help="Host to run the server on."
)
return parser.parse_args(args)
if __name__ == "__main__":
# parse command line arguments
args = parse_args()
# load the model
net = inference.load_model(args.model_path)
# start the server
app.run(host=args.host, port=args.port)

You’ll notice that the model_path is set to /opt/ml/model which is the default place that Sagemaker provides artifacts from the model registry.

The Lambda Code

The Lambda code is even easier: all you need is a function to handle requests:

# Loads the model when the lambda starts up
net = inference.load_model()
def handler(event, context):
global net
try:
return inference.infer(event['uri'], net)
except Exception as e:
logging.error(f'Could not perform inference on {event}', e)
return json.dumps({'error': 'Unable to perform inference!'})

The model is loaded the first time the Lambda is invoked, which can add some cold start latency, but it will stay in memory so long as the Lambda is active.

Keeping It Lean

AWS Lambda is an excellent way to reduce inference costs for infrequently used models but can cost more if the number of invocations increases due to a feature release or product popularity. Switching to an always-on Sagemaker Endpoint mitigates costs, but could require a rewrite of the inference code, which takes time and may introduce environment skew. The container described here works in both environments, making it easy and fast to switch between the two and get the most inference for your dollar.

The full code is available on github.

We use the technique I described here to save inference costs at Bazaarvoice. If this kind of work strikes your fancy, check out our job openings.

If you like this story, please consider supporting me by buying me a coffee or signing up for medium using my referral.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Lou Kratz
Lou Kratz

Written by Lou Kratz

AI practitioner focused on computer vision and machine learning. I’m here to enable readers to build AI into fun products. I also like photography and cooking.

No responses yet