Infrastructure as Code: Introduction to Continuous Spark Cluster Deployment with Cloud Build and Terraform

Published in

Google Cloud - Community

8 min readJan 7, 2020

Imagine you want to start building some data pipelines in Spark or implement a model with Spark ML, the first step before anything is to deploy a Spark Cluster to make it easy you could set up in minutes a Dataproc cluster, It’s a fully-managed cloud service that includes Spark, Hadoop, and Hive. Now imagine doing it many times, reproducing it in other projects or your organization want to make your Dataproc configurations a standard.

This is when a new approach comes Infrastructure as Code, IaC is the process of managing and provisioning computer data centers through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools [Wikipedia].

Cloud Build, Terraform and its last integration with Github make possible to deploy or update GCP components on-demand just as easy as making a git push to a repository.

Architecture and Scenario

Our objective is to make a simple repeatable pipeline to deploy a Dataproc Cluster.

The workflow will start when we push our code to a remote repository already linked with our Google Cloud Build, this will execute automatically all the steps in the cloudbuild.yaml…

Infrastructure as Code: Introduction to Continuous Spark Cluster Deployment with Cloud Build and Terraform

Architecture and Scenario

Written by Antonio Cachuan