Monitoring Amazon SageMaker Endpoint With Datadog

Yuming Qiao
Tech @shift.com
Published in
4 min readJul 25, 2021

--

SageMaker Endpoint Traces in Datadog

Amazon SageMaker and DataDog

Amazon SageMaker is a powerful tool for Data Scientists and Software Engineers to prepare, build, train, and deploy high-quality machine learning models. SageMaker Endpoint is an AWS fully managed service that provides an API to do real-time inference.

Just as any normal service, it’s crucial for data scientists and engineers to understand the endpoint's performance so that they can take action to improve the latency.

Datadog agent is a profiler that gives deep insights into the performance of a running application. In this post, I will introduce how to set up Datadog to monitor the performance of SageMaker Endpoints.

Background

We manage the docker image of our SageMaker Endpoints. It is written in python with Flask framework. We deployed the endpoints into AWS private subnet with a NAT Gateway. We set up the network this way to ensure the ML endpoints only serve internal requests, while are also able to communicate to the outside world when necessary (such as allow the Datadog agent to send traces to the Datadog server).

Steps

  1. Install the Datadog python library, ddtrace.
pip install ddtrace==0.38.4

Note that due to a version constraint by another package, 0.38.4 is the latest version we can use. You should use the latest version if possible.

2. Wrap functions with traces.

Since ddtrace automatically integrates with the Flask framework, you can get many traces out of the box, including the total time of a route like /invocations. You can also wrap the code in interest with trace to get the granularity insight.

# this will mesure the time it takes to get a key from redis.
@tracer.wrap(name='get_key_from_redis')
def get_key_from_redis(key):
...

3. Specify the host and port of the Datadog agent

You need to set two environment variables:

DD_AGENT_HOST="datadog.demo.shift.com"
DD_TRACE_AGENT_PORT=8126

I will discuss the detail of the Datadog hostname in the last section. For now, let’s assume the Datadog agent is running and have the hostname datadog.demo.shift.com .

4. Prefix the command to run your server with ddtrace-run.

In prod, we also use Gunicorn on top of Flask. Our original command to run the server is:

gunicorn -b 0.0.0.0:8080 -w 1 sagemaker_endpoints.server:app

Now run it with ddtrace is just adding the prefixddtrace-run:

ddtrace-run gunicorn -b 0.0.0.0:8080 -w 1 sagemaker_endpoints.server:app

5. Redeploy the SageMaker endpoint and see traces appear in Datadog!

The SageMaker endpoint takes 10.2 ms to serve this inference request, and 90% of the time spend on getting a value from Redis. The endpoint in this demo is very fast because of the model it uses. For endpoint with a more complex model, the traces can help identify the performance bottleneck.

6. Set up a fixed Datadog Hostname

As I mentioned in (3), you need to specify the host and port of the Datadog agent with DD_AGENT_HOST and DD_TRACE_AGENT_HOST. In local development, you can set this up easily in docker-compose:

version: '2.2'
services:
datadog:
build: datadog
image: datadog/agent:7.29.1
environment:
# to use datadog please use your own DD_API_KEY
- DD_API_KEY=${YOUR_KEY}
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- /proc/:/host/proc/:ro
- /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
ports:
- 8126:8126
webserver:
image: ${IMAGE_URI}
mem_limit: 4g
cpus: 1
depends_on:
- datadog
environment:
- DD_AGENT_HOST=datadog
- DD_TRACE_AGENT_PORT=8126
- DD_ENV=dev
ports:
- 8080:8080
command: bash -c "ddtrace-run flask run --port=8080"

In prod, things can be a little different. You will likely have many SageMaker endpoints, and you want to monitor all of them. A clean way to do this is to set up a standalone Datadog Agent Service. The Datadog Agent Service will have a load balancer with a friendly hostname and listen to port 8126 for traces. Your SageMaker endpoints will dump the traces to this hostname and port 8126. The load balancer will then forward the traffic to an auto-scaling group of Datadog agents. So you can have as many SageMaker endpoints as your business needs, and the number of Datadog agents behind the scene can scale up and down automatically.

Network Diagram

In a nutshell, to set up the Datadog Agent Service, you need to:

  • Identify the VPC, subnets of the SageMaker endpoints.
  • Create a private route 53 hosted zone in the VPC.
  • Create the load balancer in the same subnets as the SageMaker endpoints.
  • Create a target group with protocol=TCP, port=8126.
  • Add a listener rule to the load balancer on port 8126, forward the traffic to the target group.
  • Copy the DNS name of the load-balancer. In Route 53, create an A record with a friendly name like datadog.demo.shift.com , point it to the load balancer’s DNS name.
  • Create an ECS task definition and an ECS service, our image URI is public.ecr.aws/datadog/agent:7.29.1. The ECS service should be in the same subnets as the SageMaker endpoints. The ECS service should use the network load balancer, and use the just created load balancer and target group.
  • Deploy the Datadog Agent Service and have your SageMaker endpoints sent traces by using the friendly hostname datadog.demo.shift.com . You should be able to see the traces in Datadog UI.

Note

I initially used Datadog agent 7.23.1 but the agent cannot pass the target group’s health check (TCP, port 8126), upgrading the agent to 7.29.1solves the problem.

--

--