Using Terraform to Quick Start PySpark on AWS

📖 Step-by-step guide on how to use Terraform to spin up an EMR cluster with PySpark and Anaconda on AWS.

Published in

idealo Tech Blog

6 min readJan 7, 2019

A long time ago I wrote a post about how we can use boto3 to jump start PySpark with Anaconda on AWS. I also wrote some amount of code for it. Fast forward to 2019, the code is old and outdated. Technology has also evolved and these days people are using Terraform to manage their infrastructure which we also do here at idealo.de. In this post, I will discuss why we should use it and also describe step by step how you can use Terraform to quick start an EMR cluster on AWS.

Motivation

In our team we frequently make use of Amazon’s EMR to either process large amount of data that doesn’t fit into memory or we want to use it to speed up our prediction pipeline. For example, let’s say we’ve trained a deep learning model on GPU using TensorFlow then afterwards we want to score those models on millions of data. Here it makes sense to use Apache Spark that can handle the distributed computing part, in particular since there is a nice wrapper around TensorFlow for DataFrames on Apache Spark called tensorframes. One major challenge was and has still been the setup. It’s really time-consuming especially since we don’t run a 24/7 Spark cluster. This means, setting it up every time could cost us easily 20–25 mins. So it’s better to have a “one-click” solution that should automate this process.

The Solution

In the past I already faced this problem and use boto3 to create this kind of one-click solution. This project however is outdated and hard to maintain. For example, I needed to do my own logging and also defining the dependency between each step was not really trivial. You had to spend a huge amount of time to code up that logic. Moreover, the boto3 solution lacked features like creating security groups and IAM roles for each run and also automatically deleting them when stopping the cluster. You could code this up easily but this was still time-consuming. So that’s why we looked for another solution which should on the one hand be easy to maintain but also on the other hand offer all the features that we needed in production like being able to version control it, easy syntax, re-usable code, state management, logging and so on. Luckily, we already had good experiences with Terraform, which is an open-source infrastructure as a code tool, created by HashiCorp. You can use it to build, change, and version infrastructure safely and efficiently. For example, we’ve already used it to provision our Redshift cluster on AWS but also for other low-level applications running on EC2.

Terraform is written in Go and cloud agnostic e.g. it supports AWS, IBM Cloud, Azure, Google Cloud Platform and many others. The syntax is also pretty easy. It uses a mix of HashiCorp’s Configuration Language (HCL) and JSON format. For further information, you can also check its documentation.

The Example

The full code can be found on our Github repository.

Install Terraform. The easiest way on Mac is to use brew to do it:

brew install terraform

2. Clone the repository and then you need to adjust the bootstrap_actions.sh and pyspark_quick_setup.sh scripts if necessary. The bootstrap_actions.sh script is important for the bootstrap actions which are all actions that need to run before the software configuration is invoked on all nodes (master and slave). This is pretty handy if you need to distribute Anaconda on all the nodes and install important packages at once. The pyspark_quick_setup.sh script adds another step to the job flow which runs after the bootstrap process is finished on AWS and only affects the master node. In our case, the step script basically sets up the necessary environment variables automatically so that we can use PySpark with Anaconda.

3. Set all the necessary parameters in the terraform.tfvars file for the EMR cluster e.g. number of instances for the slave node, instance type for master/slave, Spark version, subnet-id, vpc-id, key pair name, region etc..

# EMR general configurations
name = "spark-app"
region = "eu-central-1"
subnet_id = "<subet_id>"
vpc_id = "<vpc_id>"
key_name = "<key_name>"
ingress_cidr_blocks = "0.0.0.0/0"
release_label = "emr-5.16.0"
applications = ["Hadoop", "Spark"]# Master node configurations
master_instance_type = "m3.xlarge"
master_ebs_size = "50"# Slave nodes configurations
core_instance_type = "m3.xlarge"
core_instance_count = 1
core_ebs_size = "50"

It is recommended to change the ingress_cidr_blocks to your own ip instead of using 0.0.0.0/0 which would limit the traffic to the outside and it’s better for security reasons.
You also need to setup your own VPC/Subnet on AWS as this is also recommended. You can think of a VPC as your own private data center within the AWS infrastructure and it offers a lot of advantages e.g. better security management, internal load balancers, firewall and many more.

4. Start the cluster with:

terraform init
terraform apply

You should see the name and id of the cluster as well as the the ip of the master node if everything works as expected.

5. Now you can ssh into the master node and get started with PySpark immediately. After you’re done with your Spark job, you can easily destroy the cluster with terraform destroy. This will kill the cluster and also all the dependencies associated with this like security groups, IAM roles, the scripts on S3. Cool right?

Notes

Everything in Terraform is a graph which means it builds a dependency graph from the Terraform configurations and then walks this graph to generate plans, refresh state and many more. For some further information on how Terraform uses graphs, have a look at this YouTube video “Applying Graph Theory to Infrastructure as Code”.
Since our execution plan can be represented as a graph, it would be good to visualize it as well. Terraform provides an in-built solution to do this via their cli (graphviz is needed). Their output is however not really nice. A better alternative is to use blast-radius, an open-source project to interactively visualize Terraform’s dependency graphs using d3.js.

You can use terraform plan to create an execution plan which is very useful to check all the variables and parameters again before you really provision the cluster.

Summary

Hopefully, you found this useful and will use this in your next Spark project. My team and I will continue adding features to it that fit our purposes but also everyone is welcomed to contribute to our project.

If you found this article useful, give me a high five 👏🏻 so others can find it too, and share it with your friends. Follow me here on Medium (Dat Tran) or on Twitter (@datitran) to stay up-to-date with my work. Thanks for reading!