Running Apache Spark on AWS

Published in

acast-tech

7 min readDec 17, 2019

By Mariusz Strzelecki

Amazon Web Services (AWS), with its S3 storage and instantly-available computing power, is a great environment to run data processing workloads.

When you Google “how to run Apache Spark in AWS” you’re most likely to be pointed to Elastic Map Reduce service (AWS EMR), but this isn’t the only way you can wrangle your data using Spark.

At Acast we tested two other ways of running Apache Spark-based workloads using AWS infrastructure: AWS Glue Jobs and Fargate Containers. These are running “serverless” so they reduce maintenance overhead to the absolute minimum and give you better control of costs and efficient use of computing power.

Apache Spark on EMR

EMR (Elastic Map Reduce) is an Amazon-managed Hadoop distribution. It runs on EC2 nodes and the hosts are initialized by installing data processing libraries (like Apache Spark), tuned to use S3 in the most performant way.

The intended usage pattern for EMR is to start the cluster by defining the jobs you want to run in the initial run_job_flow call, and to set the KeepJobFlowAliveWhenNoSteps flag to false (so-called short-lived clusters). In this way, the cluster starts, executes all planned work and terminates itself.

However, this method has some drawbacks. When one of your jobs fails, the cluster is shut down and you have no control over the retries policy. To rerun the failed part of the pipeline only, you need to examine the logs and prepare new job definitions for the next EMR cluster to execute.

Another way of running EMR is to start a long-lived cluster that is always up and benefits from an auto-scaling policy to borrow as much AWS computing power as you need. In this scenario you set a cluster with a small number of workers, but when there are tasks waiting in the YARN (Hadoop’s computing resource manager) queue, new nodes are spun to speed up the process.

Moreover, you have continuous access to the cluster’s master node, so you can run a short analysis any time you need to. At first glance this scenario looks perfect, but has two important downsides. The first is the cost — you pay for a small EMR cluster at all times, even if it’s not in use. The second is the immaturity of EMR in keeping long-lived clusters functional. Sometimes, due to down-scaling, nodes are just powered off — which may result in internal HDFS corruption (something that has happened a lot in my past projects).

Also, running EMR just to start a pipeline consisting of Apache Spark jobs isn’t that simple. While AWS’s UI console wizard hides the complexity brilliantly, if you plan to run pipelines in automated ways you need to understand (and set) more than 20 parameters specifying nodes fleets, cluster configuration, IAM roles, optional autoscaling rules, and more.

And, the more applications you choose to be installed, the longer you’ll wait for the cluster to initialize (because there’s no “golden image”, but all the services are installed during bootstrap). With Spark only, it takes four or five minutes to start the cluster — but if you need Jupyter or Hue as well, be prepared to wait for at least three times as long for your cluster to be ready.

When it comes to paying for EMR, you pay for EC2 instances that powered you cluster (according to their lifetime), and an additional fee for installing EMR processes there. You can cut the costs by running EC2 Spot instances, but be aware that unavailability of non-used instances may influence your cluster startup time — or the cluster might not start at all.

Spot instances are great for short-lived “exploratory” clusters, when you can interact if something goes wrong, but I wouldn’t recommend them for running nightly pipelines — unless you like being woken up by SLA-measuring monitoring because of data delay.

Apache Spark on Glue

Sometimes, you don’t need all Hadoop cluster features to run your data processing pipeline. Apache Spark bundles a lot of Hadoop libraries and data format drivers, and is able to do pretty much everything if you use S3 (and not HDFS) as the data storage.

For cases like this, AWS Glue Jobs is the number one choice — a serverless, managed ETL environment, somewhere between SQL queries (that are limited when it comes to proper data management) and running a fully featured Hadoop cluster.

In AWS Glue you first upload your pyspark (or scala-spark) application to S3 and create a job definition that involves IAM roles and the “computing power” needed for the ETL. Then you start your application, optionally specifying parameters, and AWS Glue service spawns a small, short-lived, containerized cluster that runs your app. You pay only for the time the job was running, rounded up to 10-minute intervals.

The cluster size for AWS Glue jobs is set in number of DPUs, between 2 and 100. DPU is a processing unit comprising four vCPUs and 16 GB of RAM, and underlying Spark configuration maximizes the resource allocation for the job itself. The pricing is fairly simple — you pay $0.44 for one DPU-hour, so after a few experiments on your ETL on the various data sizes, you already know how costly your data processing system will be.

The intended use of AWS Glue jobs is to define your job once then run it multiple times, changing the parameters. However, at Acast, we use it in a SQL-like way — every time we run the process, we create a new Glue Job, then the job is executed and removed.

This way we don’t need to worry about code versioning and our ETL orchestrator based on Airflow already covers all possible concurrency and parallelism cases.

Apache Spark on Fargate

Finally, there are jobs that don’t require that much power that you would need to start multiple nodes. Thankfully, Apache Spark is also capable of running well outside clustered environments in “local” mode. In this scenario, the Spark driver and all executors live within one JVM (Java Virtual Machine) as separate threads, so parallelism of your application can be as big as the number of available CPU cores.

Still, Apache Spark running in local mode in an AWS environment can easily wrangle multiple gigabytes of data per minute — which means it can process medium-sized data sets without waiting for the cluster to initialize (either “real” EMR or “virtual” Glue). And, when you pack your job and all its dependencies into the Docker container, the number of possible ways of execution inside AWS infrastructure is extended by running on EC2 nodes, AWS ECS on EC2, AWS Batch and AWS Fargate.

At Acast we run Spark jobs in local mode on AWS Fargate. Its managed containers service (similar to AWS Glue) and pricing model is fairly simple: you pay $0.04 per vCPU-hour and ~$0.005 per GB-hour of RAM. What’s more, since December 3, 2019, Fargate supports aSpot execution model that shrinks the cost by 70% on interruption-tolerant tasks.

However, if you try to run your ETL image built on Spark downloaded from the official site and based on Hadoop 2.7 (the default selection at the time of writing this post), you’ll encounter issues with accessing AWS services (like S3) on Fargate.

This is because Hadoop 2.7.3 has hadoop-aws lib loads AWS SDK in version 1.7.4. While the numbers may not give you any insight, maybe the release dates will: the latest push to 1.7.4 was in 2016. At this time, AWS Fargate was not officially released or supported by the SDK, so Spark running on this legacy library will not look for credentials inside Fargate’s IAM role.

Thankfully Spark 3.0, the new package based on Hadoop 3.2, is available. This one supports AWS SDK in version 1.11.375, which in turn supports Fargate IAM roles after setting the proper credentials provider:

Unfortunately, AWS Fargate requires some initial setup before running its first job.

Setting an ECS cluster and CloudWatch logs group:

aws ecs create-cluster --cluster-name MyCluster
aws logs create-log-group --log-group-name samplejob

And creating task definition:

aws ecs register-task-definition --family samplejob \
  --cpu 512 --memory 1024 \
  --network-mode awsvpc --requires-compatibilities FARGATE \
  --task-role-arn arn:aws:iam::xxxx:role/task_tole \
  --execution-role-arn arn:aws:iam::xxxx:role/task_execution_role \
  --container-definitions '[{
  "image": "xxxx.dkr.ecr.eu-west-1.amazonaws.com/samplejob:latest",
  "logConfiguration": { 
    "logDriver": "awslogs",
    "options": { 
      "awslogs-group" : "samplejob",
      "awslogs-region": "eu-west-1",
      "awslogs-stream-prefix": "ecs"
    }
  },
  "name": "samplejob"
}]'

After which you can run the job from Airflow using ECSOperator or with CLI:

aws ecs run-task --cluster MyCluster --launch-type FARGATE --task-definition samplejob:1 \
  --network-configuration '{"awsvpcConfiguration": {"subnets": ["xxxx"], "securityGroups": ["xxxx"], "assignPublicIp": "DISABLED" }}'

Summary

As you can see, there’s no “right” way to run Spark workloads on AWS infrastructure. At Acast we use all three ways described above, depending on what best fits the scenario we develop.

Our main, long-running pipelines usually run on EMR (mostly for historical reasons), and smaller jobs that we run during the day are powered by AWS Glue. Containers on AWS Fargate power our streaming-like mini-batching jobs that run 24/7 in a cost-efficient way.

If you’re just starting your journey with Spark on AWS, I recommend looking at AWS Glue first. It’s the easiest method to use and has the minimal management overhead.

Running Apache Spark on AWS

Apache Spark on EMR

Apache Spark on Glue

Apache Spark on Fargate

Summary

Written by Acast Tech Blog