Infrastructure as Code: Introduction to Continuous Spark Cluster Deployment with Cloud Build and Terraform

Antonio Cachuan
Jan 7 · 8 min read

Imagine you want to start building some data pipelines in Spark or implement a model with Spark ML, the first step before anything is to deploy a Spark Cluster to make it easy you could set up in minutes a Dataproc cluster, It’s a fully-managed cloud service that includes Spark, Hadoop, and Hive. Now imagine doing it many times, reproducing it in other projects or your organization want to make your Dataproc configurations a standard.

This is when a new approach comes Infrastructure as Code, IaC is the process of managing and provisioning computer data centers through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools [Wikipedia].

Cloud Build, Terraform and its last integration with Github make possible to deploy or update GCP components on-demand just as easy as making a git push to a repository.

Architecture and Scenario

Components used

Our objective is to make a simple repeatable pipeline to deploy a Dataproc Cluster.

The workflow will start when we push our code to a remote repository already linked with our Google Cloud Build, this will execute automatically all the steps in the cloudbuild.yaml file that mainly deploys a Dataproc Cluster in our GCP Project. So our job for this scenario will be:

  1. Fork the boilerplate repository
  2. Connect Cloud Build with your GitHub repo
  3. Writing the code on Cloud Build needed to deploy a Terraform container (cloudbuild.yaml)
  4. Writing the Terraform code needed to build a Dataproc Cluster (main.tf)
  5. Execute!

It means that with simple configurations and only two files we could have a pipeline to continuously deploy and destroy Dataproc clusters. Before start let’s first check some definitions about the components we’ll be using.

Important Information

Cloud Build

Cloud Build executes your build as a series of build steps, where each build step is run in a Docker container [GCP Documentation]. For this scenario, we’ll use Cloud Build to build a docker with Terraform.

Terraform

Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently. Terraform can manage existing and popular service providers as well as custom in-house solutions [Terraform Documentation]. In this opportunity, we are going to use Terraform for building a custom Dataproc cluster on the Google Cloud Platform.

Dataproc

Dataproc is a fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way [Dataproc page]. Since our purpose is only going up to deployment, we are just testing basic functionalities in order to check the correct working.

GitHub

We don’t need to explain it although GitHub is not only a Git repository hosting service, we’ll also be using an app for automatically build containers on commits to our GitHub repository.

Development

Before and during execution

Please read carefully to avoid the most common error during execution time.

  • Check if the Cloud Build API is enabled visiting the link
Enable Cloud Build API
  • Check if the Dataproc API is enable
Enable the Dataproc API
  • Give the permissions necessaries for the Cloud Build service account, for this article we are giving Owner Role but remember for a real case to give the correct privileges.
Owner is not a recommended role
  • Enable the Cloud Resource Manager API
Cloud Build needs this API enables
  • Validate if you have enough resources available in your region, so you’ll avoid quotas problems, to update your machine type edit the file terraform.tfvars
Problems with the CPU quota
CPU size requested
  • Have a Staging Bucket(GCS) already created and defined in terraform.tfvars.
Update this code with your own GCS

Fork the boilerplate repository

In order to make it simpler, I published a repository with the code needed for the next steps, please fork it. I’ll be explaining all the files in the following paragraphs.

The project looks like the image, and each file or folder has a different purpose.

Project structure

Connect Cloud Build with a GitHub repo

First, let’s create a simple GitHub repository, in the following image since it is not the first time doing this connection we could grant access to the Marketplace app to the repository directly, although we’ll be doing like it was the first time.

Creating a simple repository

Make the first push containing all the code from the boilerplate.

Google Cloud Build App Page

Now let’s make the connection accessing the Cloud Build App Page, consider that this app offers a free tier and charge per build minute above it.

Pay-as-you-go model

You need to allow all the permissions for the app to your GitHub account and select the repository you will be using, then you will have to select and allow a GCP Project that will be connected to the repository and finally creating a push trigger that will run a build every time a push is made to any branch.

1. Granting the app access to the Github repository
2. Selecting the repository
3. Creating a push trigger
4. Set up complete

Completing this, you have Cloud Build ready to build containers each time you push changes to any branch in your GitHub Repository.

Writing the code on Cloud Build needed to deploy a Terraform container (cloudbuild.yaml)

Now the project is linked with GitHub and each push will deploy all the steps described in the file cloudbuild.yaml. This file is used by Cloud Build and in order to make it work on each step, you need to indicate the docker name and by default, it will look in Docker Hub Repository or other builders provided by Google and the community and deploy the docker that will be available just during the execution of the step. Additional you can submit some arguments and this will allow us to submit Terraform commands.

If you observe the file It contains 4 steps, in all the cases when the code run it will assume that it is located in the root of the repository that is the reason you will see cd environments/$BRANCH_NAMEin step 2 to 4. In this case, Cloud Build only runs Terraform commands to have more ordered code I divided into 4 Steps

Step 1: use the alpine docker to just print the branch we are using.

Step 2: deploys a terraform docker to launch the init command It is used to initialize a working directory containing Terraform configuration files [Terraform docs].

Step 3: deploys a terraform docker to launch the plan command this command is used to create an execution plan [Terraform docs].

Step 4: deploys a terraform docker to launch the command command It is used to apply the changes required to reach the desired state of the configuration or the pre-determined set of actions generated by a terraform plan execution plan.[Terraform docs].

Writing the Terraform code needed to build a Dataproc Cluster (main.tf)

We have the repository connected to Cloud Build, the steps defined for deploying the Terraform Dockers. Now is time to write the code required for deploying a Dataproc cluster, the code needs to follow the Terraform syntax, and if you look the code below the majority is self-explanatory:

  • The first part is for the provider, in this case, GCP and the variables of the project, region, and zone.
  • The second part is to enable APIs needed to create the components.
  • Networking module lets define a custom network for the cluster, this part could be omitted if you prefer to use the default network.
  • Then comes the Dataproc definition, we are deploying a cluster with 1 master and 2 workers, mainly we use here the variables from terraform.tfvars or variables.tf to set the amount of RAM, CPU, Disk and other necessary components.
  • Finally, the firewall rules, to allow the ingress connection for SSH and Hive JDBC port.

Execute!

With all the steps completed, we just need to make a simple git push to start building our Dataproc Cluster to see the steps and check if there is any error we could monitor the execution in the Cloud Build interface.

Cloud Build execution

And finally, after around 8 minutes we have a working Dataproc cluster, so now you can deploy in any project a custom cluster.

Dataproc cluster deployed

Conclusion and Future Work

This article's intention was to understand how simple it is to start building infrastructure with code in GCP and is not only for Dataproc, you could deploy easily almost all GCP components if you want to go further exists many resources from Google Cloud like Cloud Foundation Toolkit. Please feel free to drop a comment or send me a message on LinkedIn.

Google Cloud - Community

A collection of technical articles published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Antonio Cachuan

Written by

GCP Professional Data Engineer. When code meets data, success is assured 🧡. Happy to share code and ideas 💡 linkedin.com/in/antoniocachuan/ GCP2x

Google Cloud - Community

A collection of technical articles published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

More From Medium

More from Google Cloud - Community

More from Google Cloud - Community

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade