Serverless Inference Service on AWS Fargate.

Published in

DeepLearning-101

3 min readJun 3, 2019

AWS launched Fargate in 2017, which allows you to run containers without having to manage servers or clusters. In march 2019 AWS also announced Deep Learning Containers. By combining these two one can build a serverless machine learning(ML) and deep learning(DL) pipelines. Currently Fargate only has CPU supports and request has been made for GPU support. In ML and DL workflow training and inference are two major components, where training is more compute heavy than inference. This blog will walk you through setup and running serverless inference on AWS fargate on CPU machine. A blog for serverless training will be follow once Fargate has GPU support.

Lets follow the below steps to create a serverless inference service

Container: You will need a container which has your trained model into it. Alternatively you can take standard serving container (e.g. tensorflow/serving) and add model in command when you start model server, which I did below.

Cluster: Let’s create the Fargate cluster by running the below command or you can use default cluster. Its just a logical entity we have not created any resources yet.

aws ecs create-cluster --cluster-name fargate-cluster

Task definition: In order to run your container in Fargate you have to prepare the task definition. Basically a task definition is configuration file where you specify your container, resource limit and command you want to run inside the container. You can also use task definition to connect your container to other AWS services. You can see many task definition example here.

Change your account number, role in below task definition.

{
  "requiresCompatibilities": [
    "FARGATE"
  ],
  "executionRoleArn": "arn:aws:iam::xxxxx:role/ecsTaskExecutionRole",
  "containerDefinitions": [
    {
"command": [
            "tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=saved_model_half_plus_two_cpu  --model_base_path=/models/saved_model_half_plus_two_cpu"
         ],
         "entryPoint": [
            "sh",
            "-c"
         ],
      "name": "FargateTFInference",
      "image": "tensorflow/serving", 
      "memory": 10240,
      "cpu": 0,
      "essential": true,
      "portMappings": [
        {
          "hostPort": 8500,
          "protocol": "tcp",
          "containerPort": 8500
        },
        {
          "hostPort": 8501,
          "protocol": "tcp",
          "containerPort": 8501
        },
        {
          "containerPort": 80,
          "protocol": "tcp"
        }
      ],
      "logConfiguration": {
          "logDriver": "awslogs",
          "options": {
             "awslogs-group": "/ecs/TFInference",
             "awslogs-region": "us-west-2",
             "awslogs-stream-prefix": "ecs"
          }
      }
    }
  ],
  "volumes": [],
  "networkMode": "awsvpc",
  "cpu": "4096",
  "memory": "10240",
  "placementConstraints": [],
  "family": "FargateTFInference"
}

I’ve used an official TensorFlow serving container and added a very simple model called half plus two, which takes a number and divides it by two and adds two.

Register the task: By running below command, you will receive arevision id of task which is required while creating a service.

aws ecs register-task-definition --cli-input-json    file://task_definition.json

Create the service: Lets create the service by running the below command. Update your security groups and subnets in below commands. Make sure that your security group allow inbound traffic for the port your service is available. You can control the number of container you want to run by changing ‘desired-count’.

aws ecs create-service --cluster fargate-cluster
                       --service-name tensorflow_inference 
                       --task-definition FargateTFInference:1 
                       --desired-count 1 
                       --launch-type "FARGATE" 
                       --network-configuration "awsvpcConfiguration={subnets=[subnet-xxxxx],securityGroups=[sg-yyyy], assignPublicIp='ENABLED'}"

Above command will create a service called tensorflow_inference. A difference between service and task is available here. An ECS service under the hood creates a task, so basically a service a task running forever.

Get the external IP: Goto ECS cluster fargate-cluster → under service you will see service called tensorflow_inference → click on it, there will tab called Tasks which will have running tasks list. Click on task and you will have public ip associated with it.

Run Inference: Lets run a very simple inference

curl -d '{"instances": [1.0, 2.0, 5.0]}' -X POST http://<external_ip>:8501/v1/models/saved_model_half_plus_two:predict{
    "predictions": [2.5, 3.0, 4.5
    ]
}

Conclusion: In above steps we did not launch any instance so we don’t need to manage any infrastructure. We just asked Fargate to run the inference container as service and it did.

Whats Next: You can add application Elastic Load Balancer(ELB) and increase the desired count of task. This way your service will be accessible from single ELB DNS. You can also add autoscaling to your service for horizontal scalability.

Serverless Inference Service on AWS Fargate.

Written by Gautam Kumar