Running a serverless batch workload on GCP with Cloud Scheduler — Adding Docker and Container Registry to the mix
This quick-start guide is part of a series that shows how to leverage Google Cloud Platform components to run batch workloads in a simpler way.
If you need a bit of context before getting started, please take a look at the first part of the series, I wrote describing the architecture to get the batch workload running.
To begin with, let me introduce the solution we are going to use to run the batch workload this time, using the GCP components:
If you read the first part you will notice that we have two new players, Container Registry and Docker into the mix.
The star of this article will be the Container Registry and Docker, which will enable us to run a much more complex batch job, we will talk more about it later on…
There are many articles, talking about working with Container Registry… Why should you read this one?
Instead of showing command lines and comparing approaches, I want to show you a working example on how to build a Continuous Integration around your batch workload and automate everything!
Themes covered in this post will be:
- The Batch workload
- Connect to Source Repository
- Setting up the batch workload entrypoint
- Batch workload execution explained
Without further ado, let’s go!
You will be provided a github repository with the working example.
1 — The Batch workload
As the candidate for the more complex batch workload, I’ve chosen a combination of behave, which is a python library for running BDD techniques, and alphavantage: a set of free APIs for realtime and historical data on stocks, forex (FX), and digital/cryptocurrencies.
The GitHub repository with the code is: alpha_vantage_bdd
So, let’s take a look at the scenarios that will be executed:
by running the command, at the root of the directory:
behave features/ --tags=-wip
We should see the following output:
That’s really cool, but it’s a local execution, how do we send this code to our Google Cloud project and run the batch workload there?
2 — Connect to Source Repository
We are going to use the Source Repository to extend our git workflow to GCP.
“Cloud Source Repositories are fully featured, private Git repositories hosted on Google Cloud Platform. Extend your Git workflow by connecting to other GCP tools, including Cloud Build, App Engine, Stackdriver, and Cloud Pub/Sub.”
Go to this page to start your Source Repository configuration. And once you select your repository name, which is alpha_vantage_bdd in this case, select Push code from a local Git repository, then there will be 3 options for pushing your code, I chose Google Cloud SDK.
Follow the instructions presented:
What we did was push our Github repository: alpha_vantage_bdd to a Cloud Source Repository, which lives inside our Google Cloud Project. The Cloud Source Repo works as a remote repo for our origin repo.
You typed commands similar to the following: git remote add google https://source.developers.google.com/p/my_project/r/alpha_vantage_bdd
git push google master
After pushing to the source repository you will be able to see this:
You are able to mirror your github directly instead of using it as a remote repository, following this guide, but I still prefer using it as a remote repository looking at the open issues: 73122477 and 133100479.
3 — Setting up the batch workload entrypoint
Now that we have the code inside GCP, we are going to build a Docker image for it. This is the Dockerfile:
The important file is the script that will be executed on the Docker’s entry point:
Every time the Docker container is ran (Which will be at the Compute Engine creation in our case), this entry point will execute the behave command and then send the output to Stackdriver Logging. After it’s done, the VM will be deleted.
Interesting and all… but what about the automate, automate everything lingo? How do we automatically build this Docker image?
We are going to use Cloud Build for that, and that’s where Container Registry will join the game!
“ Cloud Build lets you build software quickly across all languages. Get complete control over defining custom workflows for building, testing, and deploying across multiple environments such as VMs, serverless, Kubernetes, or Firebase.”
Let’s see the build flow with Cloud Build:
1 — Code is pushed to the Cloud Source Repository. This happens whenever a git push google master
is executed on the origin repo, that we configured previously.
2 — Cloud build is triggered by the commit on the master branch.
3 — Cloud build packages the docker image and stores it inside Cloud Storage.
4 — The image is marked in the Container Registry as the latest version.
To achieve that, go to this page to start the Cloud Build configuration:
Choose the created Source Repository, then it’s really simple, we are leaving all fields with the default value, but the image name. We are using the Dockerfile as the build configuration and marking it with the :latest
label, so we always get an up to date image with the code.
Don’t worry, Cloud Build will store the previous images for you, in case you need a quick rollback.
Once we press Create trigger, Cloud Build will be connected to our Source Repository:
We can test the trigger by pressing Run trigger, and once it’s done, our Docker image will show up on Container Registry:
As you can see it’s tagged with latest, this is what will guarantee we always have the fresh Docker image on our Compute Engine VM!
4 — Batch workload execution explained
Now that we have the Docker image inside Container Registry, it’s a piece of cake, remember the cloud function we dived into the first post? We are going to use it again! We will just change the Compute Engine configuration, but let me show you the execution flow first:
1 — Cloud Function is triggered by Pub/Sub and calls the Compute Engine API to create a VM
2 — Compute Engine retrieves the latest image from Container Registry and starts the VM
3 — The VM entrypoint starts the batch process that runs the automated tests for Alpha Vantage API’s
Once the automated tests are done, the output is sent to Stackdriver Logging.
To update the Cloud Function from the last post, let’s change the Compute Engine configuration, go to this page.
Select the “Deploy a container image to this VM instance” checkbox, and if you are curious you can click on the “Learn more link”.
For the Container image, we are going to use the one created in the previous steps, that’s stored in the Container Registry. This is important, remember to use the syntax :latest
so we are retrieving the up to date image.
At the bottom of the UI, click on the Equivalent Rest link, so we can get the configuration that will be used in our Cloud Function.
Once we have it, go to the Cloud Functions UI, and update the vmConfig
variable, replacing the startup script with the new gce-container-declaration configuration, this is how our code is going to look:
After that, hit the deploy button and we are ready!
If we go back now to our Cloud Scheduler Job, and trigger it manually we can see everything working together.
Just to remember, doing that will publish to our Pub/Sub topic, that starts our execution flow.
Press the Run Now button:
Go to the Compute Engine page after a few seconds and you will see a new VM running with the prefix batch-job-executor followed by the execution time, it’s a little trick so we always have a unique name, if we need to track problems later.
After a few more seconds you will see that the icon before the VM name changed, that’s because the VM is being deleted, once the deletion is done the VM will be gone from the instances page.
Finally, to make sure it actually did something, we are going to Stackdriver Logging page, and when we filter for the VM name we can see the results for the VM with the Container Registry Image! 👌🏻
One last thing! To show our Continuous Integration working, whenever we do a git push google master
, Cloud Build will run and create a new Container Registry image for us, tagging it with latest. On the image below, you can see that only the most recent image is tagged with latest, that means next time Cloud Scheduler runs, it will pick up the new version!
And That’s It for today!
This is the second post of a series showing how to run batch workloads in a simpler way, using Google Cloud Platform.
On this post, we showed a more complex batch workload to help you get started, and to be able to easily update the batch workload we used a combination of Google Source Repositories, Cloud Build, Container Registry, and Docker.
Thank you for your time! And stay tuned for the next post, where we will connect with Pub/Sub once again, to decouple our batch workload results and show you how to send notifications to Google Chat! Cheers!
References
- Github repository: https://github.com/mesmacosta/alpha_vantage_bdd
- Google Source Repositories: https://cloud.google.com/source-repositories/
- Google Container Registry: https://cloud.google.com/container-registry/
- Google Cloud Build: https://cloud.google.com/cloud-build/
- Alpha Vantage Python: https://github.com/RomelTorres/alpha_vantage