Don’t let Dr.Who hijack your EMR cluster
Opening port 8088 (YARN resource manager) to the world will get your EMR cluster taken over by an external exploit in about 3 minutes. Although doing so is clearly wrong and negligent, not few (myself included) may think that opening ports momentarily on a DEV environment is harmless. This experience quickly taught me it is not.
This short article intends to document what happens inside a YARN-managed cluster when exposed directly to the internet. As such, it ought to apply to all YARN deployments, not just EMR. Builds as recent as
emr-5.24.0 are affected. It hopes to serve as a reminder that it’s our sole responsibility to secure infrastructure resources when operating from a public cloud.
At the risk of stating the obvious, let’s start with what should be done to access one of the EMR application consoles without an SSH tunnel. In the EMR cluster Summary tab, locate the security groups:
Edit the Master security group (i.e. traffic rules for the EMR master node) to allow inbound traffic to the required port only to your current IP address:
Note: Connecting to the EMR master using an SSH tunnel is a more secure approach, documented under the link “Enable Web Connection” and described in detail here.
Opening the door
The remainder of this article explores what happens when port 8088 is exposed to the public:
Within just a few minutes (3–4 in my experience), your YARN resource manager console will list applications submitted by user dr.who, indicating that the workload was submitted from an unsecured YARN REST endpoint (as described in this forum post):
Shortly thereafter, without having yet submitted any “legit” work, slave nodes will suddenly show unusually high levels of outbound network traffic:
Not surprisingly, network traffic comes accompanied by max usage of all available CPU resources:
Peeking inside a slave node reveals lots of suspicious processes owned by yarn:
One of the processes we were able to observe was
/tmp/.mingetty, which had established a Stratum connection (Bitcoin mining) to a random host (in this case traced to Melbourne, Australia):
At this point, in-flight Spark jobs will begin timing out with
Lost executor # errors and similar symptoms. Soon, the slave’s response over SSH will grind to a halt, and 10–15 minutes into the intrusion, the node won’t even respond to new SSH connection attempts. Eventually, the only alternative will be to terminate the cluster.
If this behavior persists for more than a few minutes, you will receive an EC2 Abuse Report email from AWS making you aware of the issue, asking you to address it, and to document its resolution in writing:
The recommendation is straightforward: Don’t ever expose EMR ports to the world, not even for 5 minutes, not even for a development environment. Use either an SSH tunnel or a specific set of IP addresses to interact with your cluster during development. Infrastructure security in a public cloud is our responsibility, and requires at least the same measures used on-premise.