Creating a Custom AMI on Amazon EMR

Amber
6 min readFeb 28, 2022

--

Amazon EMR is AWS’ big data processing platform, offering customers the ability to run applications such as Spark, Hive, Presto, and HBase on a distributed framework. Implementing EMR in a customized manner is often an “easier said than done” situation. The complexities behind this implementation can come to the forefront and quickly become a “black box” when problems arise on custom AMIs. In this blog post, I will walk through some of the benefits of using a custom AMI on EMR, as well as provide some best practice tips on troubleshooting and implementation. But to begin our discussion, if custom AMIs are such a hassle to implement, why do so many customers use them?

For many EMR customers, the ability to create a custom AMI is the determining factor for selecting EMR as their managed Hadoop service in the AWS ecosystem. Many customers have rigid compliance and infrastructure requirements that require a customized environment on which their analytic applications can run. EMR releases 5.7.0 or later have the ability to specify a custom AMI for cluster deployment.

During cluster startup, the EMR service will run a script before application provisioning referred to as a “bootstrap action”. While customization to EMR clusters can be achieved with bootstrap actions, the length of time that bootstrap actions take to execute can impact cluster startup times and even cause provisioning failures if the timeout is reached before the cluster reaches the “READY” state.

Baking any needed installation of software, libraries, and dependencies into the Linux AMI that is used for EMR can alleviate long start up times, leading to a streamlined startup for job flow execution. Additionally, rigorous and complex OS level configurations can be baked into the custom AMI that cannot be achieved with the limited nature of bootstrap actions.

Creating a custom AMI

How can you efficiently create a custom AMI if your case requires it? Creating a custom AMI can be as straightforward as the AWS documentation outlines:

· Launch an EC2 instance from the base Amazon Linux AMI.

· Install any needed software/packages/customizations on the AMI

· Create a new AMI snapshot based on the EC2

· Launch an EMR cluster with the newly created custom AMI

Some Caveats….

Beware of using the EMR AMI as the base image: Because of the complexities around implementing a custom AMI, many customers will try to take a shortcut by assuming that the EMR AMI used by default on EMR clusters will be a good starting point as a base AMI. When they apply their customizations to the EMR AMI and attempt to launch an EMR cluster with it, it fails to provision, leaving them stumped It is important to note that this is not a supported process for creating a custom AMI! Custom AMIs created from the base EMR AMI are not supported and will lead to application provisioning errors upon cluster startup. Use the base Linux AMI for your supported EMR version, Linux AMI (EMR versions 5.7 to 5.29) or Linux AMI 2. (EMR versions 5.30 and 6.x)

Beware of using the correct instance architecture: AWS has been advocating Graviton2 instances for performance and cost optimizations, leading to widespread adoption by EMR customers. It is important to note that the underlying EC2 instance architecture for the custom AMI needs to match the instance type that it is launched on. For instance, if you were using m6g instances which uses an underlying arm64 architecture, you cannot use the same custom AMI you would use for a non-Graviton2 instance such as m5. Whether it is x86_64 or arm64, the AMI needs to match the instance type that is used in the EMR cluster.

Using Instance Fleets: To stabilize their clusters, many customers will take advantage of instance fleets over instance groups, because a number of different instance types and a range of subnets can be specified for fulfilling EC2 capacity requests. Although this might increase the options for fulfilling an EC2 request, selecting the correct custom AMI based on the instance architecture is tricky. From the EMR console, creating an EMR cluster with multiple custom AMIs is not supported. However, AWS CLI, CloudFormation, the AWS SDK, and the RunJobFlow API can all be used to provision EMR with this configuration. Please see the AWS documentation for more specific information on how to implement instance fleets with custom AMIs.

Troubleshooting Errors on Custom AMI During Cluster Provisioning

Although issues with a custom AMI may occur at any point in the lifespan of the cluster, most issues manifest during cluster launch. Therefore, this discussion around troubleshooting custom AMI issues will focus on EMR clusters with a custom AMI failing to provision upon cluster launch. Identifying the cause of the failures can be difficult because the errors encountered can be highly customized to the individual configuration that has been installed on the AMI. It is important to understand that any libraries, software, or other configurations all must be compatible with the Hadoop and EMR framework that is installed on the nodes upon cluster startup.

In general, the best way to deal with any sort of “black box” scenario when custom AMIs fail to provision successfully is to take an iterative approach to building the custom AMI. Say there are 5 different software packages you would like to install on the AMI, but you are not sure which one is causing the issue, you can begin and install package 1 and test if EMR launches successfully, then continue with package 2, and so forth. Although this may seem time consuming, it can save you a big headache when trying to narrow down where the issue occurred in provisioning.

When attempting to resolve an issue with clusters failing to provision with a fresh custom AMI, understanding the process for cluster provisioning can prove to be very helpful. At a very high level, the process for EMR cluster provisioning is as follows:

· Core and then master nodes are launched by EMR service

· Devices are mounted

· DNS is configured for EMR clusters launched in VPC

· EMR specific services are created (service-nanny, logpusher, instance controller, hadoop-state-pusher)

· Bootstrap actions are run

· Provision Node script is run

· Once applications provision successfully and all nodes are running, the cluster reaches a “READY” state

Since issues with a custom AMI can occur at any stage of cluster provisioning, the startup sequence of the cluster can further identify where you should start to look for issues on your custom AMI. For example, if the cluster fails to provision, does not have cluster logs present in S3, and it is not due to permissions issues, you can conclude the failure occurred before the EMR specific services were created and EC2 console logs could be utilized for troubleshooting.

When the instance controller process starts on a node, the Node Provisioner (usr/share/aws/emr/node-provisioner/) process begins and is responsible for provisioning all applications on the node via Apache Bigtop. The Node Provisioner uses Puppet to install and configure applications as needed for the EMR framework. Since the Node Provisioner is executed during cluster startup and the reconfiguration process, it will likely be an area to troubleshoot when clusters utilizing a custom AMI fail to provision.

Provided you have S3 logging configured at the cluster level, the Node Provisioner logs are located at s3://<LOG_URI_LOCATION>/<CLUSTER ID>/node/<EC2 INSTANCE ID>/provision-node/apps-phase/0/{UUID}/puppet.log.gz. Additionally, the stderr.gz and stdout.gz located in the same directory may also need to be considered.

Although examining these logs may initially seem daunting, don’t be overwhelmed. The puppet logs will give you an indication of where the issue is but may not specify the exact problem. Equipped with an area to focus on your custom AMI, this will likely help you narrow down which component may be responsible for conflicts that are causing cluster provisioning failures.

Due to the highly individualized nature of custom AMIs, issues can occur due to any number of reasons or at any point during the cluster lifetime. Please keep in mind that the steps outlined for troubleshooting custom AMI issues is meant to be a high-level “getting started” guide and does not encompass the wide variety of issues that may occur.

Once initial issues are overcome, implementing a custom AMI on EMR can bring substantial benefits to EMR workloads: resolving compliance and infrastructure requirements and shortening overall cluster startup times.

If you are interested in learning more on creating custom AMIs on EMR, please see the links below:

Using a custom AMI

Create Custom AMIs and Push Updates to a Running Amazon EMR Cluster Using Amazon EC2 Systems Manager

--

--

Amber

I am an AWS Big Data architect with 10+ years’ experience on databases, ETL, data warehousing, and all things data.