AWS ECS GPU Monitoring

Ozer Cevikaslan
Dolap Tech
Published in
3 min readMay 5, 2021

Hi everyone, today, I’ll take you through the basics of monitoring GPU Metrics from ECS EC2 instances on CloudWatch.

A Little Back Story

Dolap, our platform, receives around 9M product submission with a steady increase, each month. We have to moderate the products to deliver appropriate and authentic content to our users. Thus, we’ve developed a machine learning solution which scans the product images submitted and then, scores based on categories it was trained. We host it on AWS ECS.

Toward the last steps of releasing our ML App to production, we realized AWS does not gather GPU Metrics from instances for us.

Turns out, we have to get metrics from the hardware in the instance and then, send it to CloudWatch. Luckily, there is a script does the above, written by AWS. You can find it in the article below.

Automating the initiation of metric gathering

We can copy existing Launch Configuration and create new one which has user data that follows the explained steps in article above.

You can tell the commands in the image below are different than the ones in the article and that’s what I’ll explain in the next section.

What’s up with ECS?

Firstly, I should state we’ve used GPU-Optimized ECS AMI(amzn2-ami-ecs-gpu-hvm-2.0.20210413-x86_64-ebs).

We had some troubles related to the conflicts in python version and its dependencies required by the script when we used the commands used in the article above, on our machines hosted on ECS.

Let me take you through what we do.

  • Install wget and download the script using wget.
sudo yum install wget -ywget https://s3.amazonaws.com/aws-bigdata-blog/artifacts/GPUMonitoring/gpumon.py
  • Change the region and namespace.
sudo sed -i "22s/us-east-1/eu-central-1/" gpumon.py
sudo sed -i "25s/DeepLearningTrain/Image Classifier/" gpumon.py
  • Install pip for python 2.7 to have a working python environment for the script.
wget https://bootstrap.pypa.io/pip/2.7/get-pip.py
sudo python get-pip.py
  • Install dependencies required by the script.
sudo pip2.7 install nvidia-ml-py boto3
  • Run the script in the background.(Nohup keeps the process running even after exiting the shell or terminal.)
nohup python gpumon.py &

Reviewing GPU Metrics on CloudWatch

Finally, you can see metrics of GPU, Memory and Power Usage and, the temperature.

If you have any suggestions please don’t hesitate to leave a comment :)

We’re hiring! Join us if you are excited to have experiences like in the article.

--

--