Infrastructure as Code: Introduction to Continuous Spark Cluster Deployment with Cloud Build and Terraform

Antonio Cachuan
Google Cloud - Community
8 min readJan 7, 2020

--

Imagine you want to start building some data pipelines in Spark or implement a model with Spark ML, the first step before anything is to deploy a Spark Cluster to make it easy you could set up in minutes a Dataproc cluster, It’s a fully-managed cloud service that includes Spark, Hadoop, and Hive. Now imagine doing it many times, reproducing it in other projects or your organization want to make your Dataproc configurations a standard.

This is when a new approach comes Infrastructure as Code, IaC is the process of managing and provisioning computer data centers through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools [Wikipedia].

Cloud Build, Terraform and its last integration with Github make possible to deploy or update GCP components on-demand just as easy as making a git push to a repository.

Architecture and Scenario

Components used

Our objective is to make a simple repeatable pipeline to deploy a Dataproc Cluster.

The workflow will start when we push our code to a remote repository already linked with our Google Cloud Build, this will execute automatically all the steps in the cloudbuild.yaml

--

--

Antonio Cachuan
Google Cloud - Community

Google Cloud Professional Data Engineer (2x GCP). When code meets data, success is assured 🧡. Happy to share code and ideas 💡 linkedin.com/in/antoniocachuan/