Build Your Spark Image on GCP

Alexey Novakov
SE Notes by Alexey Novakov
4 min readJul 29, 2019
Cloud Build logo

You want to run Apache Spark on Kubernetes, but random DockerHub images is not an option for your organisation due to security concerns.

Fortunately, Apache Spark GitHub repository contains shell scripts to build a Docker image from sources. This is really great, because it makes easy to build an image locally and push to some private Docker registry to be used by Kubernetes deployment. Let’s create build job using GCP Cloud Build and a fork of the official Spark GitHub repository.

Clone Spark repo on Github

We need to have a Git repository with Spark sources to build a Docker image for it. GCP Cloud Build allows to link an external repository on GitHub or BitBucket. Let’s go to official Spark repository and fork it:

Official Spark Git repo. Click Fork button on right-hand side.

Now we have own repo to add a build file into it:

This is my Spark fork Git repo

Push Cloud Build YAML file

GCP Cloud Build is driven by own build configuration in form of YAML file. Let’s make our file and push it to the root folder of the forked Spark repo.

Content of cloudbuild.yaml:

steps:
- name: 'gcr.io/cloud-builders/git'
entrypoint: 'bash'
args:
- '-c'
- |
git fetch --tags --depth=1 && git checkout tags/$$SPARK_VER
env:
- 'SPARK_VER=${_SPARK_VER}'

- name: 'gcr.io/$PROJECT_ID/scala-sbt'
entrypoint: bash
args:
- -c
- |
./build/sbt package
timeout: 20m0s

- name: gcr.io/cloud-builders/docker
entrypoint: bash
args:
- '-c'
- |
./bin/docker-image-tool.sh -r gcr.io/$PROJECT_ID -t $$SPARK_VER build
env:
- 'SPARK_VER=${_SPARK_VER}'

tags: ['cloud-build-spark']
images: ['gcr.io/$PROJECT_ID/spark']
timeout: 25m0s

There are 3 steps in it:

  1. Switch the current Git workdir to a Git tag of the particular Spark release. We do git checkout as you can see. _SPARK_VER is a parameter of the Cloud Build job.
  2. Compile and build Spark from sources using SBT. Set 20 minutes timeout as it takes quite a lot of time. Default timeout is shorter, so we need to override.
  3. Call shell script, which call docker build under the hood.

Few more important things

images: ['gcr.io/$PROJECT_ID/spark']

which asks current Cloud Run job to push newly built image to GCP Container Registry.

timeout: 25m0s

Entire build is limited by 25 minutes duration. Default timeout is lower, so we need to override.

Make sure to push above file “cloudbuild.yaml” into your Spark fork-repo:

cloudbuild.yaml in the Git repo

Create CloudBuild trigger

Go to your GCP web-console and add trigger:

Adding new Build trigger (new job)

After clicking “Add Trigger”, select GitHub repo and choose your Spark fork repo from the list. On the next screen, use cloudbuild.yaml as name of the config file and add _SPARK_VER variable. See print screen below:

Trigger creation (set config name and our variable)

_SPARK_VER is set to Spark Git Tag v2.4.3

Run trigger for master branch

After you created a trigger, run it for master branch. You can go into the current run and observe build logs:

Build is in progress

Check Docker image is there

Cloud Build Job took 23 minutes in my case to build and push the image to Cloud Registry. Go to Cloud Registry to see the image is there:

Image is available in my Container Registry

Conclusion

GCP and Cloud Build allows to work with popular Git servers and let you to build a Docker image almost for any programming language and get this image available in Cloud Registry. Of course, all this makes sense if you are already building stuff with GCP.

Although, Cloud Build is still in Beta version, it is already capable to do trivial things like fetch a repo and perform a list of build steps to build project artefacts like Docker images.

Using specific Git tag is only my requirement, because I wanted an official released version. You can also build Spark image using the same build configuration directly from the master branch. So just skip the switch to Git tag.

--

--