Easy Steps to Optimise your AWS EMR Performance and Reduce Cost

Published in

FinBox

6 min readFeb 2, 2019

A major challenge while using AWS EMR is reducing costs (or optimizing performance). You need to understand the various concepts like Partitioning your RDDs, Specialised Instances, Spot Fleets, Spot Blocking, Ganglia, UI configuration, logging, pricing and many other eccentricities of EMR. So where do you start? These easy steps will fetch you some quick results.

Repartitioning your RDD

The first and the most obvious thing is to partition your RDDs in Spark in such a way that it ensures maximum resource utilization. But that’s easier said than done. Understanding the inner workings of Spark does take its own time.

A quick improvement in efficiency can be achieved by repartitioning your RDD to 2*(number of cores in your cluster) for mapping transformations.

This may not be the optimal method every time, but it will be a significant improvement.

Use Specialised Instances

You can see an improvement by simply using the correct instance type for your use case. An instance from the c family would be appropriate for the TASK nodes of the EMR in a compute-heavy Spark Job.

However, Spark is incredibly memory intensive and you may often run into memory errors. It is important to provision enough memory to your executors or driver. You can refer to this insightful blog on allocating spark resources. It’s a good idea to use an instance from the r family for the MASTER node of the EMR cluster, especially if your job requires shuffling.

Use Spot Instances

Spot Instances can help reduce your ec2 costs by 50–80%. If you already know what they are, you can move on to the next section.

Below is a description as given by AWS. You can read more here.

A Spot Instance is an unused EC2 instance that is available for less than the On-Demand price. Because Spot Instances enable you to request unused EC2 instances at steep discounts, you can lower your Amazon EC2 costs significantly. The hourly price for a Spot Instance is called a Spot price. The Spot price of each instance type in each Availability Zone is set by Amazon EC2, and adjusted gradually based on the long-term supply of and demand for Spot Instances. Your Spot Instance runs whenever capacity is available and the maximum price per hour for your request exceeds the Spot price.

Using Spot Fleets

You will face 2 major issues while using Spot Instances.

AWS won’t provision a Spot Instance if they are unavailable. If you are using your Spark Job for regular ETL, this is a catastrophic situation.
AWS may snatch back your Spot Instances at any moment if they have run out of any more instances and a higher bidder enters the market. If you are using your Spark Job for regular ETL, this is an apocalyptic situation.

These issues are addressed in these last 2 topics —

A Spot Fleet is a group of multiple spot and on-demand instances of different types. You can configure your EMR cluster to be comprised of one or more fleets.

The advantage of using a Spot Fleet is that instead of specifying the instance types you want, you can specify your computing and memory capacity requirements. AWS will provision the available instances which fulfill that requirement. This profoundly increases your chance of getting Spot Instances.

How to configure Fleets?

Requesting Instance Fleet from AWS Dashboard

Step 1 —

Define your target. If you need a total of 16 cores, set your target to 16.

Step 2 —

Define a list of instances that would be suitable for your Spark Job and assign them weights. For example, a c4.x large has 4 cores, so the weight of c4.x large would be 4. Similarly, you can list more instance types — like c4.2x large with a weight of 8 and c4.4x large with a weight of 16.

This will result in AWS provisioning you any of the following combinations —

c4.xlarge (2) , c4.2xlarge (1), c4.4xlarge(0)
c4.xlarge (0) , c4.2xlarge (0), c4.4xlarge(1)
c4.xlarge (4) , c4.2xlarge (0), c4.4xlarge(0)
c4.xlarge (0) , c4.2xlarge (2), c4.4xlarge(0)

Configuring Spot Fleet from AWS Dashboard

You can follow the same process for your memory requirement. If your cumulative memory requirement is 160 GB, then set your target to 160. Your list of instances could be — r5.xlarge (Memory — 32GB) with a weight of 32, r5.2xlarge (Memory-64GB) with a weight of 64) and so on.

*It is not necessary to only specify instances from the same family.

Step 3—

Specify the TimoutDurationMinutes. This is the amount of time AWS will look for a spot fleet that fulfills your requirements before giving up. The maximum value you can set is 60 minutes.

Step 4—

Specify the TimeoutAction. If AWS is unable to provision an EMR cluster within the TimeoutDurationMinutes, then it will carry out the TimeoutAction. TimeoutAction can either be SWITCH_TO_ON_DEMAND or TERMINATE_CLUSTER.

Here’s an example of a cluster configuration—

[
  {
    "InstanceFleetType": "MASTER",
    "TargetOnDemandCapacity": 0,
    "TargetSpotCapacity": 1,
    "LaunchSpecifications": {
      "SpotSpecification": {
        "TimeoutDurationMinutes": 60,
        "TimeoutAction": "SWITCH_TO_ON_DEMAND"
      }
    },
    "InstanceTypeConfigs": [
      {
        "WeightedCapacity": 1,
        "BidPriceAsPercentageOfOnDemandPrice": 100,
        "InstanceType": "r3.xlarge"
      },
      {
        "WeightedCapacity": 1,
        "BidPriceAsPercentageOfOnDemandPrice": 100,
        "InstanceType": "r3.2xlarge"
      },
      {
        "WeightedCapacity": 1,
        "EbsConfiguration": {
          "EbsBlockDeviceConfigs": [
            {
              "VolumeSpecification": {
                "SizeInGB": 32,
                "VolumeType": "gp2"
              },
              "VolumesPerInstance": 1
            }
          ]
        },
        "BidPriceAsPercentageOfOnDemandPrice": 100,
        "InstanceType": "c5.4xlarge"
      },
      {
        "WeightedCapacity": 1,
        "EbsConfiguration": {
          "EbsBlockDeviceConfigs": [
            {
              "VolumeSpecification": {
                "SizeInGB": 32,
                "VolumeType": "gp2"
              },
              "VolumesPerInstance": 1
            }
          ]
        },
        "BidPriceAsPercentageOfOnDemandPrice": 100,
        "InstanceType": "m4.2xlarge"
      },
      {
        "WeightedCapacity": 1,
        "EbsConfiguration": {
          "EbsBlockDeviceConfigs": [
            {
              "VolumeSpecification": {
                "SizeInGB": 32,
                "VolumeType": "gp2"
              },
              "VolumesPerInstance": 1
            }
          ]
        },
        "BidPriceAsPercentageOfOnDemandPrice": 100,
        "InstanceType": "c4.4xlarge"
      }
    ],
    "Name": "MasterFleet"
  },{
    "InstanceFleetType": "CORE",
    "TargetOnDemandCapacity": 0,
    "TargetSpotCapacity": 160,
    "LaunchSpecifications": {
      "SpotSpecification": {
        "TimeoutDurationMinutes": 60,
        "TimeoutAction": "SWITCH_TO_ON_DEMAND"
      }
    },
    "InstanceTypeConfigs": [
      {
        "WeightedCapacity": 16,
        "BidPriceAsPercentageOfOnDemandPrice": 100,
        "InstanceType": "r3.2xlarge"
      },
      {
        "WeightedCapacity": 32,
        "BidPriceAsPercentageOfOnDemandPrice": 100,
        "InstanceType": "r3.4xlarge"
      }
    ],
    "Name": "CoreFleet"
  }
]

There are 2 spot fleets that together comprise the EMR Cluster. I have named the MASTER node Fleet ‘MasterFleet’, and CORE nodes Fleet ‘CoreFleet’.

The options given for Core fleet are r3.4xlarge (weight = 32) and r3.2xlarge (weight = 16). AWS will provision the combination that will add up to (or more than) the specified target of 160.

If AWS can’t provision an EMR cluster for me in 60 minutes, it will go ahead and provision me on-demand instances. I would rather pay the on-demand price and keep my ETL running than saving the money and not run the ETL at all.

Using Spot Blocking

Spot Blocks are spot instances that are provisioned for a definite amount of time. This can be anything between 1–6 hours. AWS charges more for blocking, but still much less than on-demand. This way, you can ensure that your ETL process won’t be interrupted since AWS won’t snatch back your instances.

And there you have it. Happy Distributed Computing to you. If you are looking for more tips on reducing your AWS bills, check this series of blogs out.