EMR Cluster stuck at provisioning

Yi Ai
The Cloud Builders Guild
2 min readMay 1, 2020

Recently i have been facing EMR Cluster suck at provisioning issue since we start using custom encrypted AMI Image.

EMR service 3 times tried to launch a new instance to run as the master node and each time the instance was terminated after timeout (~16 minutes). eventually it throws Terminated with errors Failed to start the job flow due to an internal error.

I raised a support ticket and main response from AWS as below:

This causing EMR to not be able to install the required EMR daemon and the instance being terminated because of the timeout Due to our shared responsibility model, issues related to the OS of the instance are primarily the responsibility of the customer to sort out. This is because, for security reasons, AWS doesn’t have access to your OS or know exactly what was installed on the OS. Although, as premium support I can provide general assistance based on what I see from my side regarding your OS. I can see that you’re using Red Hat and based off the fields: “type=1305”, “auid=xxxxxx”, “ses=xxxx”; I suspect the issue to either be with: — the AuditD daemon configurations being misconfigured as type “1305” syscall is related to “CONFIG_CHANGE” causing the daemon to hang for a very long time — SELinux being misconfigured as I can see at some point it’s initialized then subsequently disabled “selinux=0” . Additionally, the AuditD logs which are causing the errors share the same session ID/user ID as the one used when disabling SELinux. For your convenience I’ve attached to the case the EC2_console.log for your perusal. From your end you can only get them while an instance is still running or shortly after it’s been terminated using the either the console or the “get-console-output” AWS CLI command. In terms of further troubleshooting, One of the very good recommendations provided, it states that you should be using Amazon Linux as the base AMI for custom AMIs in EMR. Should you instead want to still try with Red Hat you can maybe try totally disabling SELinux from your AMI if you’re not going to use it. Instead of enabling and then disabling SELinux.

According to the response from AWS, I changed the base AMI from RedHat to Amazon Linux, problem solved!

For more details, please check out AWS official document

I hope this could help you met the same problem.

--

--

Yi Ai
The Cloud Builders Guild

AWS Community Builder | AWS AZURE GCP Certified Engineer | A Cloud Technology Enthusiast | AWS Certified Security/Machine Learning/Database Analytics Specialty