Infrastructure as Code: Introduction to Continuous Spark Cluster Deployment with Cloud Build and Terraform
Imagine you want to start building some data pipelines in Spark or implement a model with Spark ML, the first step before anything is to deploy a Spark Cluster to make it easy you could set up in minutes a Dataproc cluster, It’s a fully-managed cloud service that includes Spark, Hadoop, and Hive. Now imagine doing it many times, reproducing it in other projects or your organization want to make your Dataproc configurations a standard.
This is when a new approach comes Infrastructure as Code, IaC is the process of managing and provisioning computer data centers through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools [Wikipedia].
Cloud Build, Terraform and its last integration with Github make possible to deploy or update GCP components on-demand just as easy as making a git push to a repository.
Architecture and Scenario
Our objective is to make a simple repeatable pipeline to deploy a Dataproc Cluster.
The workflow will start when we push our code to a remote repository already linked with our Google Cloud Build, this will execute automatically all the steps in the cloudbuild.yaml file that mainly deploys a Dataproc Cluster in our GCP Project. So our job for this scenario will be:
- Fork the boilerplate repository
- Connect Cloud Build with your GitHub repo
- Writing the code on Cloud Build needed to deploy a Terraform container (cloudbuild.yaml)
- Writing the Terraform code needed to build a Dataproc Cluster (main.tf)
It means that with simple configurations and only two files we could have a pipeline to continuously deploy and destroy Dataproc clusters. Before start let’s first check some definitions about the components we’ll be using.
Cloud Build executes your build as a series of build steps, where each build step is run in a Docker container [GCP Documentation]. For this scenario, we’ll use Cloud Build to build a docker with Terraform.
Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently. Terraform can manage existing and popular service providers as well as custom in-house solutions [Terraform Documentation]. In this opportunity, we are going to use Terraform for building a custom Dataproc cluster on the Google Cloud Platform.
Dataproc is a fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way [Dataproc page]. Since our purpose is only going up to deployment, we are just testing basic functionalities in order to check the correct working.
We don’t need to explain it although GitHub is not only a Git repository hosting service, we’ll also be using an app for automatically build containers on commits to our GitHub repository.
Before and during execution
Please read carefully to avoid the most common error during execution time.
- Check if the Cloud Build API is enabled visiting the link
- Check if the Dataproc API is enable
- Give the permissions necessaries for the Cloud Build service account, for this article we are giving Owner Role but remember for a real case to give the correct privileges.
- Enable the Cloud Resource Manager API
- Validate if you have enough resources available in your region, so you’ll avoid quotas problems, to update your machine type edit the file terraform.tfvars
- Have a Staging Bucket(GCS) already created and defined in terraform.tfvars.
Fork the boilerplate repository
In order to make it simpler, I published a repository with the code needed for the next steps, please fork it. I’ll be explaining all the files in the following paragraphs.
This repository is part of the article published on Medium This repository aims to deploy a Dataproc Cluster using…
The project looks like the image, and each file or folder has a different purpose.
cloudbuild.yaml=file used by Cloud Build to run Terraform dockers.
clouddestroy.yaml=file used by Cloud Build to run a Terraform docker that could destroy the Dataproc cluster (use carefully, and read about State Locking).
environments/master=principal branch that we'll be using
environments/master/backend.tf=file that indicate where will be saved the metadata related to Terraform.
environments/master/main.tf=principal file that will implement all the components of the Dataproc cluster
environments/master/outputs.tf=reference some output variables generated during the Dataproc creation time.
environments/master/terraform.tfvars=file for defining manual input variables values like project name, regiion, machine types, etc that will be use in the main.tf. Normaly It's more frequently update than variables.tf
environments/master/variables.tf=space for defining all variables and its default values.
environments/master/versions.tf=version required for Terraform
Connect Cloud Build with a GitHub repo
First, let’s create a simple GitHub repository, in the following image since it is not the first time doing this connection we could grant access to the Marketplace app to the repository directly, although we’ll be doing like it was the first time.
Make the first push containing all the code from the boilerplate.
git add .
git commit -m "[ADD] Commiting boilerplate"
git remote add origin https://github.com/YOUR_GITHUB_ACCOUNT/YOUR_GITHUB_REPO.git
git push -u origin master
Now let’s make the connection accessing the Cloud Build App Page, consider that this app offers a free tier and charge per build minute above it.
You need to allow all the permissions for the app to your GitHub account and select the repository you will be using, then you will have to select and allow a GCP Project that will be connected to the repository and finally creating a push trigger that will run a build every time a push is made to any branch.
Completing this, you have Cloud Build ready to build containers each time you push changes to any branch in your GitHub Repository.
Writing the code on Cloud Build needed to deploy a Terraform container (cloudbuild.yaml)
Now the project is linked with GitHub and each push will deploy all the steps described in the file
cloudbuild.yaml. This file is used by Cloud Build and in order to make it work on each step, you need to indicate the docker name and by default, it will look in Docker Hub Repository or other builders provided by Google and the community and deploy the docker that will be available just during the execution of the step. Additional you can submit some arguments and this will allow us to submit Terraform commands.
If you observe the file It contains 4 steps, in all the cases when the code run it will assume that it is located in the root of the repository that is the reason you will see
cd environments/$BRANCH_NAMEin step 2 to 4. In this case, Cloud Build only runs Terraform commands to have more ordered code I divided into 4 Steps
Step 1: use the alpine docker to just print the branch we are using.
Step 4: deploys a terraform docker to launch the
command command It is used to apply the changes required to reach the desired state of the configuration or the pre-determined set of actions generated by a
terraform plan execution plan.[Terraform docs].
Writing the Terraform code needed to build a Dataproc Cluster (main.tf)
We have the repository connected to Cloud Build, the steps defined for deploying the Terraform Dockers. Now is time to write the code required for deploying a Dataproc cluster, the code needs to follow the Terraform syntax, and if you look the code below the majority is self-explanatory:
- The first part is for the provider, in this case, GCP and the variables of the project, region, and zone.
- The second part is to enable APIs needed to create the components.
- Networking module lets define a custom network for the cluster, this part could be omitted if you prefer to use the default network.
- Then comes the Dataproc definition, we are deploying a cluster with 1 master and 2 workers, mainly we use here the variables from terraform.tfvars or variables.tf to set the amount of RAM, CPU, Disk and other necessary components.
- Finally, the firewall rules, to allow the ingress connection for SSH and Hive JDBC port.
With all the steps completed, we just need to make a simple git push to start building our Dataproc Cluster to see the steps and check if there is any error we could monitor the execution in the Cloud Build interface.
And finally, after around 8 minutes we have a working Dataproc cluster, so now you can deploy in any project a custom cluster.
Conclusion and Future Work
This article's intention was to understand how simple it is to start building infrastructure with code in GCP and is not only for Dataproc, you could deploy easily almost all GCP components if you want to go further exists many resources from Google Cloud like Cloud Foundation Toolkit. Please feel free to drop a comment or send me a message on LinkedIn.