How to Pass the Google Professional Cloud DevOps Engineer Exam
Today I took the Google Cloud DevOps exam and passed. This is my 5th Google certification (not including the various Google Cloud course completion and skill badges), and certainly one of the trickier ones.
Let me share with you my experience of this exam, including the topics I encountered, some general tips, and some guidance on how to improve your relevant knowledge.
A Recap of the Syllabus
You can see the official syllabus here. Here is my distilled version, which includes links to some Medium articles I’ve previously written on these topics:
- Designing the org hierarchy
- Infra-as-Code, including Cloud Foundation Toolkit, Config Connector, Terraform and Helm
- Managing multiple environments
- CI/CD tooling, including Cloud Build and Jenkins
- CI/CD and GitOps Google best practices, including git branching, triggers, PRs, approval flows, etc
- Artifact Registry, vulnerability management, and binary authorisation
- Deployment strategies, including canary, blue/green, rolling updates, traffic splitting
- Secrets management
- SRE, including SLIs, SLOs, SLAs and error budgets, capacity planning, incident management, postmortems, and blameless culture
- Operations and GCO, including monitoring, logging, alerting, Ops agent, metrics explorer, dashboards, Prometheus, Cloud Profiler, Cloud Trace
Topics and Scenarios I Encountered in the Exam
SRE — SLOs and Error Budgets
This topic was expected and I was well-prepared for it. But here are some hints on where to focus your studies:
- Learn how to set SLOs, given a current level of performance. From an article I wrote previously: “Set the SLO threshold at a level which, if barely met, would keep the vast majority of customers happy. And it should be set at a level where the cost of providing this reliability does not exceed the value of the service.”
- Furthermore, if you don’t yet have SLOs defined for a measured service, SRE recommends generally setting the SLO to be slightly lower (i.e. less aggressive) than the current acceptable performance.
- Remember that breaching SLO, or consuming error budget at a rate that will breach SLO, has consequences! Think about the contracted consequences between the SRE team, and the team developing features.
Logging
Of course, I expected to see GCO and Cloud Logging. But the exam focussed quite a lot on logging aggregation. I would recommend that you check out my article on how to design for Monitoring and Logging.
Be sure that you understand:
- How log sinks work.
- How log aggregation works.
- How you can route logs to different projects for logging centralisation.
- How you can reduce logging costs, e.g. by making use of exclusion. filters, and by disabling sinks you don’t need. (Which includes understanding which sinks you can remove or disable.)
- The value of structured logs and how to write them.
Monitoring
Again, I fully expected this topic. I would recommend checking your knowledge of designing metrics scopes, and different patterns for doing so. Think about how to ensure you provide the right operational visibility to different groups of users. E.g. dev teams, infra teams, SRE teams, etc.
SRE — Incident Management
The exam questions definitely tended towards a practical approach towards effective incident management.
Make sure you’re aware of the different recommended roles for incident management, including incident commander, communications lead, and operations lead.
Be sure you understand the best practices around:
- Establishing communication.
- The importance of early incident resolution, versus root cause analysis.
- Blameless postmortems.
- How to embed these best practices into an organisation.
Deployment Strategies
You definitely need to know your way around the various deployment strategies:
- Rolling updates
- Blue / green
- Canary releases
- A/B Testing
You need to know how they differ, when to use each, and also, how you would implement each (where possible) with the usual Google Cloud Compute services, i.e. Cloud Run, GKE, App Engine, and Google Compute Engine managed instance groups. In particular, be sure you understand how you can use Anthos Service Mesh to do this in the GKE environment.
Cloud Run — Traffic Splitting, Versioning, Tags
This is related to the topic above. But I must confess… My knowledge around this topic was a little weak. So I had a few head-scratching moments in the exam!
Make sure you understand:
- Cloud run revisions — i.e. an immutable version of a service.
- How to deploy a revision that will not service traffic (as always, learn both with the console and with
gcloud
). - How to use tagged revisions, to test a revision that is not serving production traffic.
- How to split traffic between revisions.
GitOps, Git Repo Organisation and Folder Structure
Fortunately, I’ve been writing on this topic lately. Even so, it can be quite difficult to know when to use separate repos, when to use separate folders in your repo, when to use different branches, etc.
You could do a lot worse than reading and comprehending Google’s best practices for using Terraform. (I’ll probably write an article on this topic very soon.)
Definitely make sure you understand:
- Separation of duties.
- Branching strategies, including pipeline layers and environments.
- Cloud Build
- Artifact Registry, including vulnerability scanning and Binary Auth
Config Connector
Yeah, I confess… I’m aware of it, and I know knew that it “lets you manage Google Cloud resources through Kubernetes”. But that was the complete extent of my knowledge on this topic.
So my advice is… Learn a bit about this!
Ops Agent Troubleshooting
Yay! I had definitely prepared for this topic, so I was pleased to see it come up.
My suggestion is to get familiar with…
- This content.
- Ops Agent common installation issues and how to spot them.
- How to configure service accounts to run the agent.
- How to check the agent has permission to write metrics and logs to GCO.
Performance Optimisation, Cloud Profiler and Cloud Trace
For the most part, you need to know what these tools are, when they are useful, and how to use them.
However, you might also want to make sure you know how to interpret the data presented, and what sort of actions you should take next. For example:
- What should you do if wall time is high, but CPU time is low? This would indicate that the application is probably I/O-bound, so you might want to look into that.
- What if your heap size is typically much larger than used heap? Well, then you’ve probably oversized your heap, right?
Also, there was at least one question that was reminiscent of the kind of “how to size a VM for performance” question that you might get on the Professional Cloud Architect exam. Do you remember your machine families? Spot instances? The relationship between network egress bandwith and number of vCPUs allocated?
Wrapping-Up
I hope this was useful for you! Before I did the exam, I wasn’t really sure what to expect. I probably should have read more articles… Like this one! Fortunately, the exam was fine, but it would definitely have been useful to know where my weaknesses might be, in advance!
Now this exam is out the way, I can spend the weekend relaxing, and then get back to writing my Google Cloud Adoption and Migration series!
Catch you later!
Other Useful Resources
- The Google DevOps Engineer SRE Learning Path
- This list of useful resources and tips from Ammett W. Make sure you check out his prep sheet.
Before You Go
- Please share this with anyone that you think will be interested. It might help them, and it really helps me!
- Please give me claps! You know you clap more than once, right?
- Feel free to leave a comment 💬.
- Follow and subscribe, so you don’t miss my content. Go to my Profile Page, and click on these icons: