A framework to build Cloud Operating Model and Governance — Part I

Kapil Gupta
Google Cloud - Community
10 min readJan 12, 2022

TL;DR

One of the basic promises of Cloud is to provide developers more agility to build their products and applications. Developers are expected to have more control over provisioning resources in the cloud environment and destroy them at will. As enterprises of all sizes are moving to cloud with this goal, one of the biggest challenges they face is to decide the right Governance and Operating model. Whether it should be centralized or decentralized, OR is this debate irrelevant now? I was part of many debates on this topic both as a cloud customer and as a Customer Engineer at Google working with many customers. There is no right or wrong answer to this.

Goal of this post is to provide readers a thought framework with several workable ideas and methods that can help build a Cloud Operating and Governance Model that gives a lot more freedom to developers as well as give governance teams the right level of controls to avoid any mishap.

This is very long topic and lot can be said about it. So I am dividing this post into 2 parts.

So, let’s get going…

Little bit of History (that I have observed in last 19–20 years)

Term Governance is almost always considered a “bad” word in modern working environments and often considered a synonym for “Control to stop you from using what you want, to build your products and applications”. IMO, it is not a true assessment and rather there is/was a reason to put Governance in such a way. Here are some reasons I saw:

  • Simplified IT portfolio: Technology lifecycle in an enterprise is a long and tough task. Not putting a Governance on it introduces many challenges
  • Cost Control: Bringing a technology in an on-prem environment can cost hundreds of thousands to millions of dollars in licensing. Long term commitments are generally needed for cost efficiency. Once introduced organizations did not want to use a competing technology to further invest for similar capabilities.
  • Supportability: Stability is always at the top of the mind when introducing a technology in a large organization. “Who is going to jump on a call at 3 AM Sunday when things go wrong?”; I have heard this question hundreds of times when I was responsible for introducing a new tech in a large fortune 50 organization. It is a 100% fair question. So supportability was a key requirement, thus many centralized specialist teams were built to provide support for several IT components and developers depended on them to make any changes in the environment.
  • Full stack ownership: When you own a datacenter you are fully responsible for the whole stack, right from the data center security, to servers, to platforms, to network devices, and on and on. You have to support all of it, which influenced organizations to carefully pick technologies after long evaluation processes. Such evaluations can take months before developers can get a chance to try them out.

Many such factors pushed organization to adopt a heavy-handed Centralized Governance and Operating model framework that resulted not only in long change cycles and “Controls” but also forced developers to use certain languages, runtimes, databases, platforms, etc. Many times in my past life, I struggled to get even a simple architecture change approved due to long manual reviews.

Some of the fundamental promises of Cloud is agility, “pay as you go” cost model and independence/flexibility to developers. So, any governance model that breaks these promises will not work for Cloud. Also, with cloud adoption organizations do not have to choose between agility/innovation and Governance (consistency, cost and security), you can have both.

Biggest question arises for organizations that are adopting Cloud at scale is, whether the Cloud Governance Model should be centralized or decentralized?

Let’s explore this in detail and try to come up with a framework that will help answer this question.

First let’s put a basic definition of “Governance” (doesn’t matter if you are on-prem or Cloud).

The Cloud Cake

I always suggest thinking of cloud as a layered cake, like the OSI Model. Each layer of this cake serves a different purpose, has different complexity levels and needs different skill sets to work with. All these layers are glued together with the Governance and Operating model. Let’s dig into these layers.

  1. Cloud Foundation — This is the layer which consists of some foundational components that makes your cloud environment work. For example, Hybrid Connectivity, Identity Synchronization, VPC structure, Resource Hierarchy/Organization structure, Billing, Organization Policies, IAM, etc.
  2. Cloud Services — These are the services that CSP like Google Cloud provides to its customers. For example; Google Compute Engine, Google Kubernetes Engine, Cloud SQL, Cloud Run, etc.
  3. Applications — These are the applications/products created by the developers in cloud using one or more services (layer#2). For example; a developer may create a REST API written in Go on top of GKE.
  4. Security (Cross Section) — Security goes across all layers above but with different focus and role. For example; for Cloud Foundation Services security is more focused on Network Segmentation, Network Security, Organization policies, Firewall rules, etc. while for Cloud Service layer Security is more focused on IAM, Guardrails, SIEM, CSPM, audit logging, access logging, etc. On top of this Security is responsible for satisfying regulations, compliance and audits.
  5. Operations (Cross Section) — Like security Operations goes across layers but with different roles and focus. For example, for Cloud Foundation Services, operations may be more focused on network observability, while for Applications layer Operations will probably focus more on App SLAs/SLOs, App logging, App Mon, Perf optimization, etc. For Cloud Services operations mostly are taken care by the Cloud Service Provider.
  6. CCoE — a body in organization that is responsible for Cloud Adoption and Success, oversee Financial Operation, building reference architecture, etc.

Governance Stages

From the implementation point of view, there are 3 stages of Governance. This DOES NOT mean that governance rules applied in these stages are different, rather it becomes extremely important to keep a consistency between all 3 stages. For example, engineers who build their products with certain governance rules should not get any surprise at the run time with different sets of guardrails.

Story of 2 Extreme Cases

Let’s take an extreme case where every developer is responsible for all layers of their “application/product segment”. Imagine Tim is a Product Developer or a Product owner and in this scenario what are some of the responsibilities he’ll have.

This is just an example and there might be more hats that Tim has to wear. End of year is approaching and there is a security audit that the company has to go through, so now for his “application/product segment” Tim needs to work with the auditors, understand regulations, produce logs and reports, fill the gaps and then close the audit, even after that Tim needs to report this to leadership. This can take weeks. In the meantime Tim’s product needs to accommodate new features that Business is requesting. Activities like these are way beyond product development. Tim’s focus should be on Product features.

Let’s take another scenario from another extreme where one or many centralized teams are responsible for all the layers. For example, developers need to go to this centralized team to create/support/maintain a GCP project for them, enable services, build required infrastructure (GCE, Cloud SQL, PubSub, etc.), etc. If not impossible, this is extremely hard to scale. Many organizations have similar operating models for running on-prem IT (change cloud tasks to on-prem tasks) but with adoption of DevOps principles this model . On many occasions I have seen this model failing in the cloud and IMHO should be avoided at any cost.

Almost alway, the answer is somewhere in between these 2 extreme models.

Requirements

What does an Engineering team need in a Landing Zone to host the code?

  • Do experimentation easily
  • Able to deploy code/application whenever needed
  • Ability to spin up resources in cloud whenever needed
  • Don’t have to wait for approvals for something that is done frequently
  • Cloud services that are needed to build a product/application are enabled (do not need extra approvals)
  • Observe the behavior of the environment and the code hosted in it
  • SLO/SLA are met for the end consumer (people or the code)
  • Environment is expandable to accommodate potential future growth

What does a Governance model need in a Landing Zone?

  • Default security setup does not allow unintended operations
  • Right level of access is granted to the dev, ops and customers
  • Budgets are set to avoid unintended costs. Observe the cost of the environment
  • Enforced consistency to deploy and manage assets in the cloud
  • Identify and fill drift from the intended desired state (this goes beyond what is done in Kubernetes)
  • Security is enforced consistently across various environments
  • Resources are configured consistently to manage risks related to onboarding, drift, discoverability, and recovery
  • Ability to know what all assets are deployed in Cloud environment, where are they deployed, when those are changed
  • Enforce clear asset ownership.

Pillars of delivering a Cloud Governance and Operating Model

1. Automation (GitOps, Workflows and Self-Service)

Automation allows us to achieve consistency, auditability, agility and ability to self-serve the customers. Over the past several years in many organizations, a lot of effort has been put to adopt DevOps for application development but such principles were not applied to infrastructure build and it often created roadblocks in product delivery pipelines. GitOps takes many of the DevOps principles to the next level and applies them on Cloud asset and infrastructure build. GitOps’ key concept is to use a Git repository to store the environment’s desired state and declaratively deploy/manage cloud assets. It enables you to predictably create, change, and improve your cloud infrastructure by using code. By having Infrastructure as Code (IaC) in a Git repository (or repositories) means it’s version controlled, auditable, scannable and open for collaboration. On top of this, Infrastructure is built using auditable and reviewable build and release pipelines. There are many tools available to support IaC, like Terraform (from HashiCorp), Pulumi, etc. Terraform is widely in use and usually my choice for IaC. An agent makes sure that the current state has no drift from the desired state, for example Argo, Kubernetes Config Connector, etc.

Here is a great post to get starting with the GitOps using Terraform in Google Cloud https://cloud.google.com/architecture/managing-infrastructure-as-code

GitOps tremendously supports Cloud Governance and makes it much more objective to achieve. Organizations that are going through their DevOps adoption journey can easily adopt GitOps practices.

Workflows enable various pieces of automation to stitch together and also enable observability and process telemetry. Self-Service is a way to serve automation to the intended audience.

2. Developer Advocacy

Customer centricity is critical to every business’ success and tons of papers written on this. When we talk about business/product innovation, the first question asked is, “Who is your customer and how is this innovation going to help those customers”. A successful business and product idea always has its customer at the center. Cloud Adoption must work in the same way. Who is the customer for Cloud environments? It’s the teams that are developing products, applications in the cloud environment; those developers are the customers of your Cloud Environment. Cloud Governance and operating model must put engineers and developers at the center and should be built around it. Lack of doing this generally results in slow adoption, frictions, grievance and longer change/innovation cycles. A successful Operating and Governance model enables developers to use the cloud platform at its full potential to build innovative products for their organization.

3. Guardrails

Invisible security and controls are no more just the cool sounding words but a reality now. Many of the cloud native security products and processes do not come in the way of developers but still enforce desired level of security using guardrails. There are many definitions available for what Guardrails are but IMO Guardrails are the way to automatically enforce policies in a Cloud environment and are built to prevent configuration mishaps. Guardrails (like the name says) are built to keep the product engineering teams aligned with the Governance rules without getting in the way of day to day build/deployment. Guardrails should prevent any drift in the environment when engineering teams want to go beyond established governance controls.

Many organizations try to establish guardrails with 100% perfection but IMO (SRE principles), there should be a goal to achieve n-9s of coverage and rest should be handled as an exception (like “error budget”). A continuous monitoring and feedback of guardrails is absolutely needed to analyze the effectiveness and gaps.

Conclusion of Part I

So far we have discussed a little bit of history to build some context on why legacy Governance and Operating models were created and how they help an On-prem environment. With Cloud adoption, a new thought process is needed and for that we proposed to think of cloud as a layered stack, stages of Governance and 3 pillars based on which new Governance and Operating model can be built.

In the next part we’ll try to build a working model that many organizations either have already adopted or adopting to build their Cloud Governance and Operating model.

A framework to build Cloud Operating Model and Governance — Part II

--

--