Auto Fix Ec2 Instance Status checks failed using CloudFormation

Praveen Vallepu
3 min readSep 28, 2022

I recently got an email from one of my customers saying that their Ec2 instances status checks are failing and need an automated solution to fix the issue. Sometimes instance checks fail in the middle of the night, so he needs to wake up and reboot the instance manually to resolve the issue. Due to this issue, the customer is looking more for an automated solution instead of fixing the issue manually.

AWS monitors the health of the Ec2 instance using instance and system status checks. If a status check fails, the Ec2 instance becomes unreachable. Instance status checks failure may happen due to Network configurations issue, memory issue, and boot-up issue. Below CloudFormation stack helps to reboot the instance using CloudWatch Alarm when it fails the status checks.

What is in CloudFormation Stack:

  1. Instance Auto recovery Alarm resource

AWS CloudWatch Alarm will be created with specific thresholds to reboot the instance when status checks fail.

2. Parameter section

The Parameter section auto-populates instance IDs in the region to create a recovery alarm in the resource section.

Ec2StatusChecksRecovery CloudFormation Template:

https://github.com/praveenv4/Task/blob/main/Cloudformation/Ec2StatusChecksRecovery.yaml

Create Ec2StatusChecksRecovery CloudFormation Stack:

  1. Log in to the AWS account and go to Cloudformation service
  2. Choose Create Stack and click on “with new resources (standard).”

3. In this example, our template is ready, so select “Upload a template file” and click on “Choose file” then select the file from your local machine.

4. After choosing the template click on Next, provide a name like “Ec2StatusChecksRecovery” and click on drop-down, then select the Ec2 instance ID that you are having the issue with.

5. Now, Click on Next then click on Create Stack.

6. After a few minutes, Stack creation will be completed.

Reproduce the issue and verify the automation:

  1. I have created the above CloudFromation stack for one of my sandbox instances.

2. Let's connect to the instance and run “ifconfig eth0 down” command on the server, which will deactivate the specified network interface.

3. From the below image, we can see that one of the Ec2 instance status checks failed.

4. In the below Image, we can see that the CloudWatch Alarm identified the status checks issue and executed the reboot action.

5. Once the CloudWatch Alarm executed the reboot action, the Instance passed the status checks.

Conclusion:
Using the above solution, we can automatically reboot the Ec2 instances when there is a status check failure.

Updated by Praveen Vallepu | on 27 SEPTEMBER 2022

--

--

Praveen Vallepu

Sr. AWS DevOps Engineer | 5x AWS Certified | Certified Kubernetes Administrator (CKA) | Python | Docker | Terraform