AWS CPU and GPU Monitoring: Best Practices for Performance Optimization

Published in

TrackIt

6 min readMay 24, 2024

When running applications on Amazon Web Services (AWS), monitoring the performance of CPU and GPU resources helps ensure optimal performance, cost-efficiency, and scalability. AWS provides various tools that enable users to monitor the utilization and health of compute resources running in the cloud.

CPU and GPU monitoring becomes particularly relevant for use cases where the demand for high-performance computing is critical. These include:

Animation and Visual Effects (VFX): Efficient CPU and GPU utilization ensures smooth rendering and real-time processing, crucial for creating high-quality animations and visual effects.
Scientific Research: Monitoring CPU and GPU performance aids in running complex simulations and analyzing large datasets, essential for advancing research in fields such as astrophysics, climate modeling, and molecular dynamics.
Financial Modeling: Optimizing CPU and GPU resources is vital for performing complex calculations and simulations in financial services, enabling accurate risk assessment, algorithmic trading, and portfolio optimization.
Healthcare and Biotechnology: Monitoring CPU and GPU usage facilitates the analysis of genomic data, simulation of protein structures, and drug discovery processes, supporting advancements in personalized medicine and biotechnology.
Gaming and Virtual Reality (VR): Efficient CPU and GPU performance is critical for delivering immersive gaming experiences and realistic simulations in virtual reality, ensuring smooth gameplay and high-fidelity graphics.

The sections below explore the key concepts and best practices for monitoring CPU and GPU performance on AWS.

Understanding CPU and GPU Metrics

CPU Utilization: Percentage of Allocated CPU Resources in Use

Analyzing CPU utilization is crucial for optimizing performance and managing costs effectively on AWS. CPU utilization refers to the percentage of allocated CPU resources that are actively being used by applications and processes. Monitoring CPU utilization metrics helps identify bottlenecks, scale resources as needed, and ensure efficient resource allocation.

CPU Credit Usage and Balance (for Burstable Instances): Consumption and Accrual of CPU credits

CPU credit usage and balance are important for burstable (T2 and T3) instances on AWS. These instances accrue CPU credits during periods of low usage, which can be used during bursts of activity. Monitoring CPU credit usage and balance helps determine how efficiently CPU resources are being utilized by burstable instances. This translates into informed decisions about instance sizing and optimization.

GPU Utilization & GPU Memory Usage: Percentage of allocated GPU resources being used & Memory Consumption

On the GPU side, monitoring GPU utilization and memory usage is essential for applications that leverage GPU resources, such as machine learning, rendering, and scientific computing workloads. GPU utilization indicates the percentage of allocated GPU resources actively being used, while GPU memory usage tracks the amount of GPU memory being consumed by processes and applications.

AWS Monitoring

Amazon CloudWatch is a powerful tool for monitoring and analyzing various metrics, logs, and events across AWS services. It offers a centralized platform to collect, visualize, and set CPU/GPU alarms along with other performance metrics. CloudWatch can be integrated with other AWS services for comprehensive monitoring and automation of resource management tasks.

Setting Up CloudWatch Metrics

Most AWS Managed AMIs include the CloudWatch Agent pre-installed, simplifying the process of monitoring CPU metrics with just a few clicks. However, if the Agent is not pre-installed, it must be manually installed and configured on the EC2 instances to enable the collection of additional metrics. For GPU monitoring, additional configuration may be necessary to collect and publish GPU-specific metrics.

It is important to ensure that the EC2 instance has the necessary permissions to publish to Cloudwatch Metrics. Users have the flexibility to define and publish custom metrics tailored to their applications. Further customization of metric collection can be achieved by selecting appropriate aggregation periods and granularity levels.

Monitoring Best Practices

Setting Up Alarms and Notifications

CloudWatch alarms can be configured to trigger notifications based on predefined thresholds to enable proactive resource management. For example, alarms can be set to notify when CPU or GPU utilization exceeds a certain threshold, allowing admins to take actions such as shutting down or resizing unused resources to optimize cost and performance.

Utilizing Dashboards

Utilizing custom dashboards can significantly enhance a company’s ability to visualize and analyze key metrics effectively. Dashboards can be created to display real-time and historical data, providing insights into resource utilization trends and performance metrics.

Optimizing Resource Allocation

Optimizing resource allocation is a continuous process that involves adjusting instance types and sizes based on monitoring data. By analyzing CPU and GPU metrics, underutilized or overburdened resources can be identified, paving the way for informed decisions that help optimize performance and cost efficiency. This process involves resizing instances, choosing appropriate instance types, and implementing auto-scaling policies to dynamically adjust resources based on workload demands.

GPU-Specific Monitoring

For GPU-specific monitoring on AWS, users can utilize the NVIDIA System Management Interface (nvidia-smi), which is a command-line utility designed for monitoring NVIDIA GPUs. This tool provides detailed information about GPU utilization, memory usage, temperature, and more.

Additionally, GPU performance can be monitored on Amazon EC2 instances that are optimized for GPU workloads, ensuring efficient utilization of GPU resources for tasks such as machine learning, rendering, and data processing.

Examples

1. High-Performance Computing (HPC) Applications

Scenario: Organizations running simulations, scientific research, or other heavy compute jobs on Amazon EC2 instances that utilize GPUs.

Implementation: Setting up monitoring enables the automatic shutdown of instances during inactivity or underutilization, saving costs without human intervention. Python scripts can be utilized to monitor usage and trigger the necessary actions based on predefined thresholds.

2. Machine Learning Model Training

Scenario: Data scientists training models on GPU-enabled instances. These tasks can be resource-intensive and expensive, especially when models are in the training phase for prolonged periods of time.

Implementation: Automating the monitoring of CPU/GPU usage helps maintain efficient utilization rates. If the GPU is underutilized for a specified duration, an auto-shutdown Lambda function can be used to stop or terminate the instance, optimizing resource use and controlling costs.

Setting up Automated GPU and CPU monitoring on Amazon EC2 instances

The ‘Automated GPU and CPU Monitoring on AWS EC2 Instances’ Gist provides a comprehensive solution for setting up automated GPU and CPU monitoring on EC2 instances using a combination of Terraform configurations, Python scripts, and PowerShell scripts. The setup is designed to handle the creation of IAM roles, Lambda functions, and CloudWatch alarms, with a specific focus on Windows systems.

Conclusion

As companies increasingly transition to cloud-based workflows, the need for efficient CPU and GPU monitoring becomes paramount. Effective CPU and GPU monitoring ensures optimal performance, resource utilization, and cost management, aligning with the demands of resource-intensive workflows. By implementing automated monitoring systems, organizations can streamline operations, identify potential bottlenecks, and optimize resource usage.

About TrackIt

TrackIt is an Amazon Web Services Advanced Tier Services Partner specializing in cloud management, consulting, and software development solutions based in Marina del Rey, CA.

TrackIt specializes in Modern Software Development, DevOps, Infrastructure-As-Code, Serverless, CI/CD, and Containerization with specialized expertise in Media & Entertainment workflows including AWS Studio in the Cloud (SIC), Retail workflows, High-Performance Computing environments, and data storage.

In addition to providing cloud management, consulting, and modern software development services, TrackIt also provides an open-source AWS cost management tool that allows users to optimize their costs and resources on AWS.

About Lucas Braga

Data Scientist by formation and DevOps by experience, Lucas has been a DevOps Engineer at TrackIt since 2021. With 8 years of experience spanning Media & Entertainment, TV, and Design, he brings a unique perspective to projects.

An out-of-the-box thinker and serial problem solver, Lucas excels at finding innovative solutions.