AWS CloudWatch: Monitoring Spot Instances

Berkay Çınar
Trendyol Tech
Published in
5 min readFeb 16, 2023

In this blog post, I will briefly explain Amazon CloudWatch, including how to monitor Spot instances, set alarms, and take corresponding actions based on those alarms.

Observability and monitoring is crucial for understanding the performance and behavior of software systems. It allows for proactive identification and resolution of issues, and can aid in troubleshooting and debugging. By providing insight into the inner workings of a system, observability can help to ensure the reliability and stability of software.

Amazon CloudWatch is a monitoring service that provides data and operational insights for various AWS resources. It helps users monitor and troubleshoot their applications, as well as optimize resource utilization.

CloudWatch collects monitoring and operational data in the form of logs, metrics, and events. It also provides a range of visualization and alerting tools to help users analyze and act on this data. CloudWatch Logs allows users to monitor, store, and access their log files from Amazon Elastic Compute Cloud (EC2) instances, Auto Scaling, and other cloud resources. It can be used to troubleshoot issues and identify patterns in log data.

We use Fluent Bit for streaming log data from Kubernetes to CloudWatch, which can be seamlessly integrated into an EKS cluster. Once the Fluent Bit configuration is complete, log data from containers running in the EKS cluster will be streamed to CloudWatch in real-time. By using Fluent Bit and CloudWatch together, we gain valuable insights into the operation of their applications and infrastructure.

Alarms are essential for identifying and resolving issues that arise in a system. Setting alarms in Amazon CloudWatch typically involves using either the built-in metrics or logs collected by Fluent Bit, depending on the specific use case and monitoring needs. In this regard, there are various ways to configure alarms using metrics, and the following are among the most common methods:

  1. Single metric alarm: This type of alarm is based on a single metric, such as CPU utilization or network traffic. You can set a threshold for the metric, and the alarm will trigger when the metric value crosses the threshold.
  2. Composite alarm: A composite alarm is based on multiple metrics, and you can define a Boolean expression that combines the metrics. For example, you might create a composite alarm that triggers when the CPU utilization is high AND the number of requests is high.
  3. Anomaly detection alarm: With Amazon CloudWatch anomaly detection, you can create alarms that trigger when a metric value deviates significantly from the expected behavior. This can be useful for detecting unusual spikes in traffic or unexpected drops in performance.
  4. Percentile-based alarm: With percentile-based alarms, you can create alarms that trigger when a metric value exceeds a certain percentile threshold. For example, you might create an alarm that triggers when the 99th percentile of response times is greater than 500ms.
  5. High-resolution custom metric alarm: Amazon CloudWatch supports high-resolution custom metrics, which can be used to monitor detailed performance metrics with sub-minute granularity. You can create alarms based on these metrics just like you would with standard metrics.

How to monitor spot instances?

When using Spot node group instances to host applications, it is important to be aware that Amazon EC2 may not always be able to fulfill your request for Spot instances. This can happen when the demand for Spot instances exceeds the available capacity, or when the Spot price exceeds the maximum price that you are willing to pay.

If EC2 is unable to fulfill a request for Spot instances, it can cause problems with your Auto Scaling group. For example, if you have configured your Auto Scaling group to scale out when the demand for EC2 instances increases, and there are not enough Spot instances available to fulfill the request, the scale-out action will fail.

Here is an example CloudWatch alarm that can be used to monitor the difference between InService and Desired instances. This alarm will notify you when there is a difference between the two metrics that exceeds a certain threshold, allowing you to take action to ensure that your Auto Scaling group is operating at the desired capacity.

CloudWatch Events allows users to create rules that match incoming events and route them to various targets, such as Amazon SNS topics, Auto Scaling Group, and more.

When the InService/Desired state is in an alert state and the EC2 is unable to meet a request for Spot instances, there are several possible solutions to try to resolve the issue. We will explore two of these solutions.

One potential solution is to use Auto Scaling actions to launch On-Demand instances as a backup mechanism. By doing so, the system can ensure that the desired capacity of the group is met even if the number of available Spot instances is limited.

The second solution entails using a Lambda function that subscribes to an SNS topic. This function can be triggered by a CloudWatch alarm when the ratio of requested instances to the number of in-service instances in the group exceeds a certain threshold. This Lambda function can notify you when the Auto Scaling group is unable to meet the demand.

The Lambda that subscribes to the notification topic can perform one or more of the following actions:

  1. Attempt to request additional Spot instances: The function can try to request additional Spot instances from EC2, using a higher maximum price if necessary.
  2. Scale out using On-Demand instances: If it is not possible to obtain additional Spot instances, the function can scale out the Auto Scaling group using On-Demand instances instead.
  3. Send an alert: If the scale-out action is still not possible, the function can send an alert (e.g. via email or Slack) to notify the appropriate personnel of the issue.

The best solution will depend on your specific use case and requirements. It may be necessary to try a few different solutions and monitor the results to determine what works best for your needs.

Overall, Amazon CloudWatch is a powerful monitoring service that can help users improve the performance, reliability, and security of their applications and infrastructure on AWS.

--

--