How to Set up an AWS Ubuntu EC2 for Amazon ECS with GPU Enabled for Deep Learning Frameworks

Published in

Xylem | AI and Big Data

8 min readJan 2, 2024

Nowadays, deep learning frameworks (e.g., PyTorch) can be used to easily train machine learning models requiring GPUs. However, how to deploy the deep learning frameworks in a cost effective way into a production environment is a debated topic in Machine Learning circles of expertise.

One approach for deployment of deep learning frameworks is through use of containerized applications, via Docker, through Amazon Web Services’ Elastic Container Service (ECS).

Benefits achieved with this approach include:

Enhanced flexibility and control over the infrastructure
Lower costs
Broader application versatility compared to existing “out of the box” ML Ops platforms.
Scalability
High availability
Security
CI/CD integration

The purpose of this blog post is to describe how to set up an AWS Ubuntu EC2 for Amazon ECS with GPU Enabled for hosting Deep Learning frameworks.

Reference Coursework/Training

A number of technologies are employed below to acheive the purpose of this blog post. As you are reading this blog post, if you feel you need more background/training, here are a number of amazing resources.

The Missing Semester of Your CS Education— Great MIT online course covering Linux shell commands.

Instructions

First let’s take a quick look at the general framework, as shown in the image below (Figure 1). The general framework is inclusive of:

Set up an Ubuntu EC2 instance (an Amazon Virtual Machine)
Install the necessary drivers (or interfaces).
The ECS itself can be simply considered as a Docker image, which manages the web app image that we deploy to the AWS Amazon Elastic Container Registry (ECR).

Figure 1. The Relations between EC2 and ECS

Launch an Ubuntu Instance on AWS

The first step is to create an auto scaling group under the EC2 page. This is needed in order to define a capacity provider for Amazon ECS workloads.

A capacity provider requires:

A name
An autoscaling group
Settings for managed scaling and managed termination protection

When you create an auto-scaling group, AWS will guide you through creating a launch template. Be sure to include ‘allow-ssh-from-vpn’ in the security group settings so that you can use your laptop’s terminal to connect to an EC2 instance via SSH. Do not forget to choose a free Ubuntu Image (e.g., ami-06aa3f7caf3a30282 for Ubuntu 20.04) . You can also specify the instance type in the launch template, or you can decide on it later. For instance, I used ‘p3.2xlarge’ for the instance type.

Figure 2. A Snapshot of the Launch Template for Autoscaling Group

After you create an Auto Scaling group with a capacity of one, an EC2 instance will automatically be attached to the Auto Scaling group you just created. On the same page, you can find ‘Elastic IP’ in the left panel. It is recommended to assign an Elastic public IP to the created EC2 instance, especially since the instance will need to fetch some packages online (e.g., a newer version of CUDA for Nvidia).

Assign an IAM role to the EC2 Instance

This step is essential to enable communication between the EC2 instance and the ECS cluster.

On the detailed page of the EC2 instance you created, click ‘Actions’ and then ‘Security’. From there, modify IAM roles. For our web application which needs to use an Amazon S3 bucket, I attached the following three policies to the IAM role:

The first two policies are standard policies provided. The policy of ec2LogPermissions aims at to send log information from an EC2 instance to the ECS service. You can alternatively add it through inline policy (Json format preferred shown below)

Create an ECS Cluster

Via AWS Management Console, navigate to the Amazon ECS service.
Select “Clusters” from the sidebar and then click the “Create Cluster” button.
Choose the “EC2 Linux + Networking” template. This allows configuration of the ECS cluster with specific EC2 instance types that are GPU enabled, networking settings (like VPC, subnets, and security groups), and IAM roles.
For Autoscaling group, you can choose one created at the section 1.
After configuring these settings, review your choices and click “Create” to launch your cluster.

Install Docker, Cuda, and Container Toolkit on EC2 Instance

After SSH connection to the EC2 instance you created prior, install docker using commands below.

Installed Docker: 
 [•] sudo apt-get update 
 [•] sudo apt-get -y install ca-certificates curl gnupg lsb-release 
 [•] sudo mkdir -p /etc/apt/keyrings 
 [•] sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg — dearmor -o /etc/apt/keyrings/docker.gpg 
 [•] echo “deb [arch=$(dpkg — print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable” | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null 
 [•] sudo apt-get update 
 [•] sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin 
 [•] To check if docker is running > sudo systemctl status docker Optional Commands if you want to avoid typing sudo with docker 
 [•] sudo usermod -aG docker $USER 
 [•] newgrp docker

To test for successful installation, run the command “docker run hello-world”.

Next, install Nvidia Cuda for GPU. Here is the official link from Nvidia (NVIDIA CUDA Toolkit 12.1 Downloads ). Detailed commands are shown below.

[•] sudo wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb 
[•] sudo dpkg -i cuda-keyring_1.0–1_all.deb 
[•] sudo apt-get update 
[•] sudo apt-get -y install cuda 
[•] export PATH=/usr/local/cuda-12.3/bin:$PATH 
[•] export LD_LIBRARY_PATH=/usr/local/cuda-12.3/lib64:$LD_LIBRARY_PATH 
[•] nvcc — version

Run the command “nvcc — version”, you should see something like below.

We can now install the container tool kit. Here is the official link from Nvidia (

Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit 1.14.3 documentation ). Detailed commands are as follows.

[•] sudo curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg — dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed ‘s#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list 
[•] sudo apt-get update 
[•] sudo apt-get install -y nvidia-container-toolkit 
[•] sudo nvidia-ctk runtime configure — runtime=docker 
[•] sudo systemctl restart docker 
[•] sudo docker run — rm — gpus all nvidia/cuda:11.0.3-base nvidia-smi

We should see result after running the command of “sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base nvidia-smi " like below.

Install ECS agent on the EC2 Instance

Configuring an EC2 instance with a GPU is a VERY TRICKY step. If you follow the instructions solely from the Amazon webpage (Working with GPUs on Amazon ECS — Amazon Elastic Container Service ), you might encounter numerous configuration errors. However, it’s possible to selectively use the configuration information from this webpage. It’s also recommended to install the ECS agent using an alternative method found on another webpage (Installing the Amazon ECS container agent — Amazon Elastic Container Service ). By blending information from both pages, you can find the correct approach for this task. Let’s start with the configuration.

[•]sudo mkdir -p /etc/ecs 
[•]sudo echo ECS_CLUSTER=cluster_name_you_created | sudo tee -a /etc/ecs/ecs.config 
[•]cat /etc/ecs/ecs.config 
[•]sudo vim /etc/ecs/ecs.config 

#add this to the config file 
ECS_DATADIR=/data 
ECS_ENABLE_TASK_IAM_ROLE=true 
ECS_ENABLE_TASK_IAM_ROLE_NETWORK_HOST=true 
ECS_LOGFILE=/log/ecs-agent.log 
ECS_AVAILABLE_LOGGING_DRIVERS=[“json-file”,”awslogs”] 
ECS_LOGLEVEL=info 
ECS_ENABLE_GPU_SUPPORT=true 
ECS_NVIDIA_RUNTIME=nvidia 
ECS_CLUSTER=cluster_name_you_created 

------------------------------------------------

We can install the ECS agent in this way.

[•] sudo curl -O https://s3.us-east-1.amazonaws.com/amazon-ecs-agent-us-east-1/amazon-ecs-init-latest.amd64.deb 
[•] sudo dpkg -i amazon-ecs-init-latest.amd64.deb 
[•] sudo systemctl start ecs 

NOTE: In case you encounter the situation where the ecs-agent container is restarting continuously, 
you can try deleting the checkpoint file using “sudo rm /var/lib/ecs/data/agent.db” and hopefully it should stabilize.

Now you can run “docker ps” to check, if one container named “ecs-agent” is running or not.

Test a Sample Task in ECS Cluster Created

Now, you can return to the ECS cluster you created to check if the EC2 instance is connected to the ECS cluster after executing the commands mentioned above. The figure below highlights the details we need to pay attention to. First, ensure that you use the same Auto Scaling group in the ASG (Auto Scaling Group) from the section on Capacity Providers. If it does not exist, you can create a new capacity provider based on the Auto Scaling group you previously created. In the section labeled ‘Container Instances,’ you should be able to click on the Instance ID, which will direct you to the EC2 instance you have created so far.

Figure 3. An Example of ECS Cluster Details after Connecting to the EC2 Instance

Now, you can use a task definition to create a sample task with the JSON code provided below, and then run the task in the ECS cluster you’ve created. From the sample task logs, you may observe the status of the task change from PENDING to RUNNING, and finally to STOPPED. You may also find that the container successfully ran the nvidia-smi command and exited with the Exit Code ‘0’. Additionally, you can check this from the EC2 instance using the command ‘docker logs ecs-agent’

{ 
  “containerDefinitions”: [ 
    { 
      “memory”: 200, 
      “essential”: true, 
      “name”: “cuda”, 
      “image”: “nvidia/cuda:11.0-base”, 
      “resourceRequirements”: [{ 
        “type”:”GPU”, 
        “value”: “1” }], 
      “command”: [ 
        “sh”, “-c”, “nvidia-smi” 
      ], 
      “cpu”: 100 
    } 
 ], 
 “family”: “example-ecs-anywhere-gpu” 
}

Push Your Image to ECR

Follow directions documented here for how to push images to AWS ECR.

If you get stuck: Tips in Dealing with AWS Service/Products

Components of AWS are like powerful Lego bricks. If you happen to get stuck with your “Lego build” along the way of this documented process, here are four lessons we’ve learned to help get unstuck.

Use AWS documentation through Search.
Use GPT-4 for general questions about AWS and shell commands.
Look at the source code of the AWS service you are interested in. For instance, you can examine the open-source code and guidelines for AWS ECS from its master repository.
If you have AWS Enterprise support, use AWS support ticket and request an online meeting. If you’re stuck with an issue after 1–2 weeks of effort, it’s advisable to take advantage of AWS technical support. You can submit a support ticket and receive guidance from their team.

Please feel free to leave us a comment, if you meet any challenge in the process. We will surely reply to your question as soon as we could.

Acknowledgements

Tremendous thanks to all colleagues who have helped me to get it work! Special thanks to John Blake Duffie, Alex Lilley, Chris Hardison, and Reinaldo Maciel, who spent hours per discussion/meeting with me.