Experiences that Prepared Me for the Cloud DevOps Engineer Exam

Aron Eidelman
Google Cloud - Community
6 min readNov 15, 2022

Disclosure: I am a Google employee. The ideas reflected in this post are personal and do not reflect my employer’s views.

I joined Google several months ago as a Cloud Operations Advocate. As part of my ramp-up time, I prepared to take the Cloud DevOps Engineer certification since it overlapped the most with the use cases I’m focused on in my role. Without making assumptions about job titles or specific products, I want to tune into the experience that other engineers have on Google Cloud. I saw some of my own experiences reflected in the exam content, which supports the validity of any technical certification.

Google’s SRE handbook had a good amount of bearing on the exam content, which surprised me. What I wanted to avoid more than anything was a 2-hour round of “feature and configuration trivia,” otherwise known as “multiple choice that you could ace with reference docs.” This was no such exam. It is good to know general configuration patterns, but the best mark of knowledge based on experience is having a deep, intuitive sense of how things can go wrong. I liked that the exam asked questions in this direction and that I could use my experience to reason through the possibilities.

“Be warned that being an expert is more than understanding how a system is supposed to work. Expertise is gained by investigating why a system doesn’t work.” — Brian Redman

In this post, which I’ll add to over the coming weeks, I want to share several challenging experiences before joining Google that gave me a deeper understanding of why it makes sense to do things a certain way. (For those who came for general study tips, I’ve added some to the final section.)

I tapped into these experiences while studying new material for the exam, thinking, “How would I have had better outcomes in the past if I had done X or used Y?” I found this approach helped me integrate new information. It also helps to learn from other people’s experiences, which I benefitted from reading the SRE handbook, and which I hope some will benefit from reading here as well.

Hidden Tradeoffs

A few years ago, I worked with a customer on integrating a solution to prevent user account takeover. The problem ranged from bots enumerating through credentials to criminals committing account fraud. Since the activity occurred within an application, observing specific actions at the account level was necessary.

The developers would typically need to add the solution’s SDK to their login flow so that it could log regular attempts and intercept malicious ones. Developers didn’t love needing to write and maintain extra code around the SDK, so the solution provider came up with a “codeless” variant: a customer could add an edge function to their favorite CDN, and boom, it would magically zero in on the relevant requests.

In reality, there was still some configuration required. It just wasn’t the application developers who needed to do it. The edge function relied on response status codes, custom headers, or content in the response body to know if a login attempt had succeeded or failed. Since that could change dramatically from application to application, a person from the solution provider needed to step through the customer’s app manually, the way a regular user would, and test out various requests and responses. They would only know how that particular app represented successful and failed logins and have the information they needed to write the configuration.

To understand how “custom” this could get, keep in mind that not every development team uses the RFC for HTTP status codes. Sometimes, every login attempt receives a 200 response. From there, the difference in responses could be very subtle. The configuration occasionally hinges on the string “error” or “denied” being included in the response body or an opaque header simply being absent for failed logins.

So what would happen if, post-configuration, the application developers decided to change the response for a failed login attempt?

What if they inadvertently removed the indicator necessary for the configuration to work?

In this case, the solution’s ability to detect and block malicious traffic could be at stake. And since security succeeds when nothing bad is happening, things might still appear to be working.

So the developers would be better off at least writing some tests to preserve the indicators so they’d know if they were potentially breaking the solution by making a change.

But that would entail writing code, perhaps even more than just implementing the SDK.

The other problem was that most customers only used the CDN with the edge function in production environments.

They had no way to justify a CDN for staging. As a result, there was no way to see whether the edge function was working, even manually, before production.

Suppose they bit the bullet and, in desperation, added a comment in their code, “Before changing this response, make sure to ask the solution provider to update the configuration for the edge function.” Yikes, I know, but still, would that work?

How would they ensure the third-party solution provider published the new edge function’s configuration simultaneously when the company deployed the latest version of their application? What if the company needed to roll back the most recent version? Because there was no automation for updating the configuration, and even the submission of the configuration file was entirely manual, it would perpetually create a bottleneck to any release that touched the login responses.

The likelihood that this operational gap could slip through the cracks in testing or deployment or that merely changing the people on the team could lead to this configuration being completely forgotten seemed to trade against the value of the “codeless” approach.

Where it took away some initial coding from development, it added manual work and a lack of confidence to the release process.

As a result, the reality was that the “codeless” approach might be nice for some cooker-cutter scenarios and proofs of concept, but most customers would be better off with the SDK.

It was a helpful scenario to remember for the exam because it reinforced the following points:

  • If developers cannot test a feature, or if the team cannot automate a portion of the application deployment, consider how the resulting issues could affect production. How would they impact users? How long would it take to (1) realize a problem and then (2) fix it? Some key areas, such as security and availability, may be too sensitive to gamble with, even if you can’t guarantee them 100%.
  • Always think in terms of tradeoffs as opposed to pure improvements. If something seems purely good (e.g., a “codeless” add-on), question what you are bargaining away and if you can afford to do so. You might be able, but you don’t want to be surprised if you have already committed and then realize it entails manual work, higher risk, and lower release velocity.

Realistic SLAs

Out now!

Study Resources

My colleague, Ammett, put together a great post with resources for the Cloud DevOps Exam. In particular, I used the prep sheet he created to double-check that I’d covered all the necessary sections.

Another colleague, Luke, had suggested closely reviewing the SRE handbook. Just before the exam, he reassured me that even if it felt like it was too difficult halfway through, not to lose hope.

While I did not join a study group or work with anyone else preparing for the exam, it did help to discuss the exam topics with people who had direct experience in the relevant areas.

One discussion group you can join, Reliability Engineering, has a lean coffee format wherein you can propose topics to discuss, and people can vote on their favorites. A discussion about SLOs in that group gave me a great mental model that helped me during the exam and helped me come up with my post on why to prioritize symptoms over causes.

--

--

Aron Eidelman
Google Cloud - Community

DevSecOps at Google, Board Chair at Azure Printed Homes, Dadalorian at Home