Build Your Spark Image on GCP
You want to run Apache Spark on Kubernetes, but random DockerHub images is not an option for your organisation due to security concerns.
Fortunately, Apache Spark GitHub repository contains shell scripts to build a Docker image from sources. This is really great, because it makes easy to build an image locally and push to some private Docker registry to be used by Kubernetes deployment. Let’s create build job using GCP Cloud Build and a fork of the official Spark GitHub repository.
Clone Spark repo on Github
We need to have a Git repository with Spark sources to build a Docker image for it. GCP Cloud Build allows to link an external repository on GitHub or BitBucket. Let’s go to official Spark repository and fork it:
Now we have own repo to add a build file into it:
Push Cloud Build YAML file
GCP Cloud Build is driven by own build configuration in form of YAML file. Let’s make our file and push it to the root folder of the forked Spark repo.
Content of cloudbuild.yaml:
steps:
- name: 'gcr.io/cloud-builders/git'
entrypoint: 'bash'
args:
- '-c'
- |
git fetch --tags --depth=1 && git checkout tags/$$SPARK_VER
env:
- 'SPARK_VER=${_SPARK_VER}'
- name: 'gcr.io/$PROJECT_ID/scala-sbt'
entrypoint: bash
args:
- -c
- |
./build/sbt package
timeout: 20m0s
- name: gcr.io/cloud-builders/docker
entrypoint: bash
args:
- '-c'
- |
./bin/docker-image-tool.sh -r gcr.io/$PROJECT_ID -t $$SPARK_VER build
env:
- 'SPARK_VER=${_SPARK_VER}'
tags: ['cloud-build-spark']
images: ['gcr.io/$PROJECT_ID/spark']
timeout: 25m0s
There are 3 steps in it:
- Switch the current Git workdir to a Git tag of the particular Spark release. We do
git checkout
as you can see. _SPARK_VER is a parameter of the Cloud Build job. - Compile and build Spark from sources using SBT. Set 20 minutes timeout as it takes quite a lot of time. Default timeout is shorter, so we need to override.
- Call shell script, which call
docker build
under the hood.
Few more important things
images: ['gcr.io/$PROJECT_ID/spark']
which asks current Cloud Run job to push newly built image to GCP Container Registry.
timeout: 25m0s
Entire build is limited by 25 minutes duration. Default timeout is lower, so we need to override.
Make sure to push above file “cloudbuild.yaml” into your Spark fork-repo:
Create CloudBuild trigger
Go to your GCP web-console and add trigger:
After clicking “Add Trigger”, select GitHub repo and choose your Spark fork repo from the list. On the next screen, use cloudbuild.yaml as name of the config file and add _SPARK_VER variable. See print screen below:
_SPARK_VER is set to Spark Git Tag v2.4.3
Run trigger for master branch
After you created a trigger, run it for master branch. You can go into the current run and observe build logs:
Check Docker image is there
Cloud Build Job took 23 minutes in my case to build and push the image to Cloud Registry. Go to Cloud Registry to see the image is there:
Conclusion
GCP and Cloud Build allows to work with popular Git servers and let you to build a Docker image almost for any programming language and get this image available in Cloud Registry. Of course, all this makes sense if you are already building stuff with GCP.
Although, Cloud Build is still in Beta version, it is already capable to do trivial things like fetch a repo and perform a list of build steps to build project artefacts like Docker images.
Using specific Git tag is only my requirement, because I wanted an official released version. You can also build Spark image using the same build configuration directly from the master branch. So just skip the switch to Git tag.