Google Cloud Adoption: Site Reliability Engineering (SRE), and Best Practices for SLI / SLO / SLA

Dazbo (Darren Lester)
Google Cloud - Community
24 min readFeb 5, 2024

Welcome to the continuation of the Google Cloud Adoption and Migration: From Strategy to Operation series.

In the previous part, we looked at how to reorganise your existing infra teams, how to go about upskilling and reskilling, and how to build an effective Cloud Centre of Excellence (CCoE). In this part, I’m going to cover the embedding and best practices of Site Reliability Engineering (SRE), including:

  • The problem statement: why do we need SRE?
  • What is SRE?
  • The Key Tenets of SRE.
  • Eliminating Toil.
  • Service Levels (KUJ, SLI, SLO, Error Budget, Alerting policies, and SLA).
  • Best Practices for SRE-Based Monitoring and Alerting.
  • Walkthrough: defining SLOs, notification channels and alerting with the Console.
  • Walkthrough: defining SLOs, notification channels and alerting with Terraform.
  • Adopting SRE in your Organisation.

Why Do We Need SRE?

- Dazbo

It is common for traditional enterprises to have adopted an Ops / Dev split.

Let me start by saying: the traditional Ops/Dev split is a bad model. In such a split, we have two separate teams, each with very different goals:

  • The Dev Team is responsible for developing new features. The addition of new features makes a given service more useful to the customer. And the continuous creation of features is what allows an organisation to retain a competitive edge. The performance of the Dev Team is typically measured using metrics such as: Number of changes per unit of time, code churn, and commit-to-deploy time.
  • The Ops Team is responsible for reliability and stability. They are typically measured using a different set of measures, such as: Service uptime / availability, incidents per unit of time, Mean Time to Resolution (MTTR), Mean Time Between Failures (MTBF) and SLA compliance.

This approach leads to an “Us-And-Them” mentality. The Dev Team creates new features, packages it, throws it over the fence to the Ops Team, and the Ops Team then deploys it into Production.

The Problem with Seprate Dev and Ops Teams

Note how the two teams have no common goals. And even worse, their respective goals are in opposition. Why? Because introduction of new features inevitably leads to a reduction in stablity and reliability.

“There is no world in which new features are deployed and work seamlessly, 100% of the time.” — Dazbo, 2024

Also, there’s no overlap of responsibility in the various development and deployment stages. And there’s very little skillset overlap. So, delivery phasing looks something like this:

Phases and responsibilities

But It’s So Much Worse Than This

The traditional Ops Team suffers a number of further issues…

An excess of toil

Toil is defined as reactive, repetitive, low-value work. I.e. the kind of mundane activities that keep the lights on, but doesn’t result in any innovation, new capabilities, or new business value.

When teams have too much toil, it results in:

  • Low morale. Doing the same low-value tasks, day-in and day-out, results in boredom and frustration. And it results in people asking themselves, “What am I doing here?”
“Life? Don’t talk to me about life.”
  • Career stagnation. The members of the Ops Team are continually doing the same old activities. They don’t get to work on projects, learn new skills, or add new value.
  • Attrition. If people believe they have no career prospects and their morale is low, then they’re more likely to leave.
  • Low-Talent Bias. Whilst this is something of a generalisation, when all the above is true, there is a tendency to only retain those people who are most fearful of change. These are often the same people that are reluctant to do things a different way, and who are reluctant to adopt new technologies. They are also fearful that they will not be able to secure a new role in a different organisation. So over time, the high rate of attrition eliminates the talent in the Ops pool. Then the organisation is in huge trouble.

SRE to the Rescue

“SRE is awesome.” — Dazbo, 2023

SRE originated in Google in 2003, when an engineer by the name of Ben Treynor Sloss was asked to run a production operations team. (OMG, I just realised that at the time I’m writing this, SRE is over 20 years old!!) He recognised the inherent tension between development and operations, and designed a new approach. This new approach — called Site Reliabilty Engineering — is an opinionated and prescribed approach that:

  1. Applies software engineering principles (like automation, CI/CD, testing, retros) to the problem of operations.
  2. Provides operational visibility, through clearly defined metrics.
  3. Recognises the tension between Operations and Development, and explicitly faces into this tension by contracting when new features may be deployed, given a measured current level of reliability.

What Is SRE?

The goal of SRE is: to protect, provide for and progress software and systems with consistent focus on availability, latency, performance and capacity.

The term SRE is used to refer to any and all of:

  • The practice (or perhaps the discipline) of Site Reliability Engineering.
  • An SRE team or function.
  • An individual site reliability engineer. I.e. one whose responsibility it is to carry out SRE activities. In this context, the plural of such engineers is typically referred to as “SREs”.

From a team perspective: rather than a Dev Team and an Ops Team, we now have a Dev Team and an SRE Team. You might think that this sounds like the same two teams, but with different branding.

Rick and Morty

Fortunately, this is far from the truth. The SRE team differs from a traditional Ops Team in many ways. As I explain SRE, the differences will become clear.

Let me summarise a number of the key tenets of Site Reliability Engineering. Then later, I’ll dive into some of these:

  • SRE mandates the ongoing measurement of reliability, and provides guidance on how to do so.
  • SRE mandates the use of service levels; particularly for reliability. When the system is within compliance of service levels, then more features may be deployed. Rules are contracted regarding what is allowed when service levels are breached, or at risk of being breached.
  • The SRE Team has the authority to stop feature deployments, if reliability is being compromised.
  • Shared ownership: the mutually agreed service levels, along with agreed contracts between Dev and SRE, results in the two teams aligning behind common goals.
  • SRE states that 100% reliability is not achievable. Recall that I said earlier: “There is no world in which new features are deployed and work seamlessly, 100% of the time.” Thus, service levels are set in acknowledgement of this fact.
  • SRE prescribes approaches to learning from incidents, including a blameless culture, and blame-free post mortems.
  • SRE sets limits on the amount of toil that is acceptable, and prescribes measures for eliminating toil. This includes adoption of software engineering practices, including automation, infrastructure-as-code, MVPs, iteration, and canary releases.

How Does This Help?

I’ll briefly cover how SRE addresses the issues I described earlier:

  • “Us-and-Them” mentality is largely removed. The two teams have common goals and agreements in place. They collaborate much more closely. And they have a better understanding (as well as more skillset overlap) with each other.
  • The contracting and authority of SRE to push back on dev changes means that there’s a prescribed way to achieve the ideal balance between feature velocity, and overall stablity.
  • SREs spend much less time on toil, and much more time on value-adding activity. So their morale is much higher.
  • SREs spend more time on project work, and using modern software tools and practices. Thus, they continue to maintain their skills, and their work is much more interesting. This also contributes to higher morale.
  • Attrition levels are much lower in SRE teams relative to traditional Ops teams. In fact, being an SRE is a very attractive role and results in the attraction of talent.

Additionally, Google estimates that between 40% and 90% of any given system’s TCO comes after launch into Production. So if we have a framework that can a) improve the reliability of the system and b) do it for less overall cost, then this will lead to a significant reduction in the overall TCO of the system.

Eliminating Toil

A small amount of toil is unavoidable. And in small doses, it can be cathartic. But as I described above, too much toil is terrible for individuals, and terrible for the organisation.

SRE recommends setting a limit for the amount of toil allocated to any given team member. In Google, they set this limit at 50%. But in reality, the work ends up looking like this:

SRE Time Allocation

Here are some tips to help you to achieve this:

  • Measure the time spent on different activities. Toil counts as any time spent on being on-call, restoring service in response to incidents, or performing repetitive tasks.
  • Identify toil activities that are candidates for automation. Approach this with a business case / return-on-investment mindset. I.e. think about how much time is spent over a year (or 3 years, or 5 years…) performing a given repetitive task. Then estimate the effort — and consequently the cost — to automate the task (and maintain the automation).
  • For those candidates that have a positive ROI, prioritise them, and then get your SRE Team to automate. It is the SREs themselves that design and build the automation. This is interesting and cool work, and results in the elimination of mundane activities.
  • As more and more toil is removed, more time is freed up for innovation and value-adding activity.

Service Levels

Service Levels sound… fascinating

Talking about service levels used to put me into a coma. But in the context of SRE, it’s actually very interesting!

As I’ve mentioned above, there is a tension between service reliability, and feature velocity.

  • As we increase the reliability goal, we reduce our ability to deploy new features.
  • Increasing reliability also typically comes with increased cost, e.g. through infrastructure redundancy.

One of the key goals of SRE is to establish service levels that define the lowest acceptable level of reliability. This not only minimises wasteful infrastructure, but it also allows for the maximum delivery of new features.

And so, for any new system deployed to cloud, we can define a prescriptive approach for:

  1. Defining a set of key user journeys that can be used to represent the reliability and availability of the system. This set of KUJs is the best indicator of the overall health of the system.
  2. Identifying a set of service level indicators (SLIs), mapped to each of these key user journeys. These are very specific and numeric measures, that can be used to measure the reliability of a service, in the context of this user journey.
  3. Creating a service level object (SLO) for each SLI, which defines a minimum threshold of reliability, over a given time frame.
  4. Defining an error budget, which defines the maximum tolerable non-reliability of the service in a time frame.
  5. Defining alerting policies, such that alerts are triggered to appropriate channels (e.g. email, SMS, Slack, ServiceNow, PagerDuty), when the SLO is breached or under threat of being breached.
  6. Defining the actions that need to be taken, when the alert is received.

(Although I haven’t got into the “individual system / migration design” portion of this series yet, it makes sense to include the SRE steps that we would follow for any design, in this article.)

Availability and Reliability

Note that the terms reliability and availability are often used synonymously, but their meanings are subtly different. Availability refers to the amount of time that a system (or service) is available over a duration. Note that if a system has significantly degraded performance — e.g. such that a request to a website takes a painful amount of time to respond from the persective of the user — then this service should be considered as not available.

Reliability is the ability of a system to perform its required functions under stated conditions for a specified period of time. It is subtly different, because a system may be available, yet be performing incorrectly or giving the wrong answers. In this situation, the system is available, but not reliable.

So availability is an aspect of reliability.

In order to set our service levels, we need specific measures for reliability and availability. But what should we measure?

Key User Journeys (KUJ)

A given system will typically have many different user journeys. I.e. interactions with the system to perform a specific function. Even though we use the term “user journey”, it’s important to note that not all users need to be human. We can have human actors, and we can have system actors. Thus, these categories of user journey:

  • Human-initiated — e.g. a user logging onto a system, adding an item to a basket, or completing checkout.
  • System-initiated — where one system or service interacts with another. This might be one service calling another; it might be a process in response to a message landing on a topic; or it might be some sort of scheduled event.

A typical system will have MANY user journeys. But for the purposes of measuring system reliability, we want to pick just a handful. We want to find the key user journeys (sometimes called “critical user journeys” or CUJs) which represent the core logic of the system.

In the SRE section of our solution design, we might record our key / critical user journeys like this:

Example key / critical user journeys

Service Level Indicator (SLI)

The SLI is what we actually measure. Each SLI must be specific and quantitatively measurable. Furthermore, to be most useful, an SLI should be closely aligned to a user’s perception of the reliability of the system, rather than being some sort of infrastructure metric.

(Of course, you should also measure, monitor and alert on infrastructure metrics — e.g. storage used, or CPU thresholds exceeded. But the important thing is that these infrastructure thresholds do not mandate the same sort of response as an SLO breach will.)

Here are the guidelines for defining the SLIs for a system:

  • All SLIs should be expressed as a proportion or percentage, according to this pattern:
Formula for an SLI
  • SLIs should be a measure of user experience.
  • Aim to have around 3–5 SLIs in total.
  • SLIs should be used as measures for your KUJs. So, if you have three KUJs, you might aim to have one SLI per KUJ.

Here are a few sample SLIs:

  • Coverage example:
    “% of messages that are successfully transformed and written to destination topic”
  • Availability example:
    “% of successful /login requests”
  • Latency example:
    “% of successful /login requests serviced within 1 second”

Service Level Objective (SLO)

Whereas SLI is what we measure, SLO defines the acceptable threshold, over a specified rolling period of time. Thus, we create an SLO by taking a given SLI, and adding a threshold and a time period.

It’s important to note that SLO is an internal objective. It is used to trigger remediation action by the SRE team. The SLO should be set at a level that is an appropriate early warning level, before the customer experiences significant pain in using our service. More on this later.

Here are some general guidelines around defining SLOs:

  • We define an SLO for each SLI.
  • Every SLO is defined with a threshold, and a time window. I.e. an SLO is a threshold goal for the SLI, as measured over a period of time. Often, we measure of over a rolling 30 day period.
  • Where SLO defines the minimum level of reliability, the flip-side is error budget, which defines the maximum allowed level of unreliability. More on this later.
  • SLOs need to be achievable, reasonable and agreed.
  • One important point to note is that SLOs are ALWAYS less than 100%. Remember when I said: “There is no world in which new features are deployed and work seamlessly, 100% of the time.” Consequently, if we were to set a 100% SLO, then we are implicitly stating that no functional changes (e.g. feature deployment) will ever be allowed.
  • Remember that the more aggressive the SLO, the more costly the solution. So we should set the SLO threshold at a level which, if barely met, would keep the vast majority of customers happy. And it should be set at a level where the cost of providing this reliability does not exceed the value of the service. (So, be analytical. Ask yourself: “What is the true cost of non-availability of this service?”)
  • Be mindful of the use case, and be mindful of where it is measured. For example, it might be realistic to set a 99.5% availability SLA for a service at the load balancer. But it is unrealistic to expect this level of availability as measured from a sample of mobile user devices, given that mobile devices will often experience poor network connectivity, and this is beyond your control.
  • There must be documented consequences of an SLO breach. For example, a consequence might be: preventing any further feature changes, until reliability is restored. (Remediation changes, of course, will be allowed.)
  • Your initial SLOs may be best guesses at the right level of reliability. When making initial guesses, err on the side of achievable. You can tighten up the SLOs later.
  • Review SLOs periodically.

Here are some example SLOs:

Sample SLOs

Error Budget

Error budget is the flip-side of the SLO. Whereas the SLO defines the minimum level of reliability, the error budget defines the maximum acceptable level of unreliability. It is the amount of allowed unreliability, over a given period of time. It is defined as:

For example, consider an availability SLO set at 99.5% over 30 days. Our corresponding error budget would be 0.5%. We can convert this to an amount of time:

  • 30 days = 43200 minutes.
  • 0.5% of 43200 minutes = 216 minutes (or 3.6 hours)

So, with a 99.5% availability (often referred to as “two and a half nines”) requirement, we are allowed 3.6 hours of non-availability per month. This gives us breathing room. This error budget can be consumed by such things as:

  • Inevitable failures — e.g. a network failure
  • Planned downtime — e.g. for disruptive maintenance and upgrades
  • Deploying new features
  • Risky experiments

Knowing about error budget leads to a very useful concept: error budget burn rate. This is the rate at which our error budget is being consumed. This is super important, because it allows us to alert when our current rate is unsustainable, and is on track for breaching our SLO. This is much more useful than simply waiting unti the SLO is breached!

Let me drop a few useful terms, and then I’ll walk you through an example.

  • Error budget: 100 %— SLO
  • (Actual) burn rate: the rate at which error budget is being consumed (typically measured over a short time)
  • Base burn rate: the maximum allowed burn rate which, if sustained, would result in all our error budget being consumed at the end of the SLO time period.
  • Burn rate multiplier: actual burn rate divided by base burn rate

Now for the walkthrough:

  • We have an application that processes 100 messages per day. So we would expect to process 3000 messages in 30 days.
  • We set a coverage SLO of 99.0% over 30 days. This means that our SLO will be breached if we successfully process fewer than 2970 messages over 30 days.
  • Our error budget is 100%-99% = 1%. This means that we are allowed to fail to process up to 30 messages over 30 days.
  • So our base burn rate is 1 message per day. I.e. if we were to fail to process exactly 1 message per day, then after 30 days, we will have exactly consumed the error budget. And we will have just met our SLO.

The chart below shows how quickly our error budget would be consumed, given different burn rates:

How quickly error budget is consumed, given different burn rates

We can see that if the actual burn rate were 3x higher than the base burn rate, then our error budget would be consumed in one third of the month, i.e. after 10 days.

If the actual burn rate were 10x higher than the base burn rate, then our error budget woudl be consumed in just 3 days.

So we can continuously measure the actual burn rate, and set a multiplier threshold that will trigger an alert. This is perhaps one of the best SLO metrics for alerting.

Alerting Policy

We should trigger alerts whenever the SLO is breached, or at risk of being breached.

Typical alerts might be:

  • Slow burn rate alert — 3x burn rate threshold exceeded over the last 5 hours.
  • Fast burn rate alert — 10x burn rate exceeded over the last 60 minutes.
  • Error budget 90% consumed.
  • SLO breached. (Error budget exhausted.)

In each case, we should also consider the notification channel we want to use. For example, the slow burn rate alert might be sent to an email group mailbox. But the fast burn rate alert might result in a priority message to PagerDuty.

Service Level Agreement (SLA)

The SLA is a commitment — typically a contractual agreement — between the provider and consumer of your service. (Whereas SLO is an internal objective, the SLA is an external agreement.) It is a commitment to maintain a level of reliability over a period of time. There is often a penalty if the service provider fails to meet the SLA.

SLAs should be more tolerant than the internal SLO. This gives an opportunity to take corrective action before the SLA is breached. For example, you might have an internal SLO of 99.95%, but set the SLA of 99.9%.

The SLA is agreed with the customer. You should aim to agree an SLA, such that breaching of the SLA would cause some inconvenience to the customer.

Example SLO and SLA

The image below represents the customer experience, as our measured level of service (the SLI) deteriorates. When the SLO is under threat, we need to take remedial action. When the SLO is breached, we’re at risk of further degradation and then breaching the SLA. Once the SLA is breached, the customer won’t be happy.

Customer experience, as we breach SLO and then SLA

Hands-On Demo: Defining an SLO in the Cloud Console

Here I’ll do a quick demo of how we can set up an SLO and associated alerting for a service.

First, we can clone a sample Node.js application. Then we’ll deploy it using App Engine. The app.yaml is already supplied.

# Clone the demo app
git clone https://github.com/haggman/HelloLoggingNodeJS.git
cd HelloLoggingNodeJS

# Create App Engine
gcloud app create --region=us-central1
# Deploy Hello Logging app to App Engine
gcloud app deploy

The demo application returns good responses to HTTP requests, but ocasionally returns an error response. The rate of error response is configurable. Here’s the function that acheives this:

//Generates an uncaught exception every 1000 requests
app.get('/random-error', (req, res) => {
error_rate = parseInt(req.query.error_rate) || 1000
let errorNum = (Math.floor(Math.random() * error_rate) + 1);
if (errorNum==1) {
console.log("Called /random-error, and it's about to error");
doesNotExist();
}
console.log("Called /random-error, and it worked");
res.send("Worked this time.");
});

Initially, we can see that it generates a random error at a frequency of 1 in 1000 requests. Thus, 999 out of 1000 requests will be good, so we should achieve an SLI of 99.9%.

We can test the app by firing a request to the /random-error service:

Testing our sample app’s /random-error

Let’s now generate constant requests into the service from a Cloud Shell session. Kick this off and leave it running:

while true; \
do curl -s https://$DEVSHELL_PROJECT_ID.appspot.com/random-error \
-w '\n' ; sleep .1s; done

Here, we’re generating a new request every 0.1 seconds. Thus, a rate of 10 requests per second. At this rate, we expect to see an error roughly once every 100 seconds.

Now we’ll look at GCO Monitoring in the Console:

The initial view when we open Google Cloud Operations Monitoring

Let’s define our SLO. Navigate to SLOs in the menu:

The SLO view in Cloud Monitoring

Click on the “default” for the App Engine service we deployed:

Viewing the App Engine Service in Cloud Monitoring

Click on Create SLO. We will first be asked to specify the SLI:

Define the SLI that this SLO will be associated with

Set the SLI as shown above. Click Continue. Then we can set the SLO itself:

Defining the SLO

Here we can choose the duration of the SLO, and the threshold. In this demo, we’ll give with a rolling 7 day period. And we’ll set the SLO to 99.5%.

On the next screen, GCO has helpfully suggested an SLO name, and even shows the JSON for this SLO:

SLO ready to be finalised

Once we create the SLO, we can take a look at it in Monitoring:

Observing our SLO

Note how the error budget has automatically been set. Since the SLO threshold is 99.5%, we have an error budget that allows for an error rate of 0.5%. But our error rate is currently 0.1%. So we’re not (yet) going to consume our error budget at a rate that puts our SLO under threat.

We’ve created the SLO, but we haven’t yet created an alerting policy. When our SLO is breached (or under threat of being breached), we need some sort of notification.

We can create the SLO alerting policy by clicking on the “Create SLO Alert” button. We then see a screen like this:

Defining the SLO alerting policy

Here, I’ve renamed the policy to “Slow burn rate”, I’ve set the burn rate threshold to 3, and I’ve set the lookback duration to 10 minutes. So, this alert will trigger if the actual error budget burn rate is greater than 3x of the base burn rate, over the last 10 minutes.

Now we need to define where the alert will go:

Setting up alert notification

There are loads of out-of-the-box alerting channel options:

Alerting channels

For this demo, I’ve just used my email address:

Sending alerts to your email

Finally, it is a really good idea to add some sort of helpful information that will be included in your alert. Note that you can use markdown.

Setting up helpful alert information

Now we just need to actually trigger our alert! We can do this by increasing the rate that our demo app returns an error response. Let’s change the app so that it now errors once in every 20 requests. (I.e. a 5% error rate.) This means we’ll be running at an SLI of approximately 95%.

Increase the error frequency

We need to redeploy the app:

gcloud app deploy

After deploying, we see that the error budget is consumed much more quickly:

Error budget being depleted

Our alert gets triggered, and the email looks like this:

Email for our SLO alert

Eventually, we run out of error budget:

Error budget consumed

And that’s how you create the SLI, SLO and alerting policy from the Google Cloud Console!

Defining the SLO with Terraform

But what if we’d like to automate this? This is easy enough to do with Terraform.

Assuming you’ve already created a resource for your App Engine service, you can create an SLO like this:

# Define the SLO for successful HTTP responses
resource "google_monitoring_slo" "request_success_slo" {
service = google_monitoring_custom_service.appengine_service.service_id
slo_id = "req-success-slo"
display_name = "Request success SLI for /random-error service"

goal = 0.995
rolling_period_days = 30

request_based_sli {
good_total_ratio {
good_service_filter = join(" AND ", [
"metric.type=\"appengine.googleapis.com/http/server/response_count\"",
"resource.type=\"gae_app\"",
"resource.labels.module_id=\"default\"",
"metric.labels.http_path=\"/random-error\"",
"NOT metric.labels.response_code_class=\"2xx\"",
"NOT metric.labels.response_code_class=\"3xx\"",
])

total_service_filter = join(" AND ", [
"metric.type=\"appengine.googleapis.com/http/server/response_count\"",
"resource.type=\"gae_app\"",
"resource.labels.module_id=\"default\"",
"metric.labels.http_path=\"/random-error\"",
])
}
}
}

Here we have:

  • Defined an SLO with a unique ID.
  • Set the threshold to 99.5%.
  • Set the SLO duration to be rolling 30 days.
  • Defined the SLO ratio of good events to total events. The good events include all requests that return a 2xx response code. Whereas the total events include any response code.

Now let’s define our email notification channel:

# Define an email notification channel
resource "google_monitoring_notification_channel" "email_alert" {
display_name = "Email Alert"
type = "email"
labels = {
email_address = "bob@gmail.com"
}
}

This is pretty self-explanatory.

Let’s define the alerting policy, such that the email is sent when we see our slow burn rate above the 3x threshold:

# Define a slow burn rate alert policy
resource "google_monitoring_alert_policy" "slow_burn_alert" {
display_name = "Slow Burn Rate Alert for /random-error"
combiner = "OR"

conditions {
display_name = "SLO Slow Burn Rate Condition"
condition_threshold {
filter = "select_slo_burn_rate(\"projects/my-prj/services/${google_monitoring_custom_service.appengine_service.service_id}/serviceLevelObjectives/req-success-slo\", 60m)"
duration = "3600s" # 60 minutes
comparison = "COMPARISON_GT"
threshold_value = 3
aggregations {
per_series_aligner = "ALIGN_RATE"
}
}
}

notification_channels = [google_monitoring_notification_channel.email_alert.id]
documentation {
content = "Follow guidance in [runbook](path/to/runbook)."
mime_type = "text/markdown"
}
}

Here we define an alert that includes a single condition: “SLO Slow Burn Rate Condition”. It uses the “select_slo_burn_rate” selector in the filter condition. There are many different time-series selectors you can use, and they are listed here. Then, it looks back over the last 60 minutes, and triggers if the burn rate is greater than 3x of the base burn rate.

And that’s how you would set up SLOs, notification channels, and alerting policies, using Terraform.

Adopting SRE in your Organisation

There are a few things that need to happen, in order to properly adopt SRE in your organisation:

Establish Your SRE Team Structure

As already discussed in this article: when adopting public cloud, it’s really important to establish SRE capability. This capability is a significant evolution on the role of a traditional operations team.

There are many ways an organisation might choose to organise SRE capability. Google recommends a handful of approaches, depending on organisation size, cloud maturity, etc. I’ll summarise them here:

  • The Kitchen Sink / “Everything SRE” Team — here there is a single centralised SRE team, responsible for all SRE activities across the platform and all services running on the platform. This is recommended for smaller organisations with few applications.
  • The Infrastructure SRE Team — here an infra SRE team is complementary to other SRE Teams. This particular SRE team is a central “hub” team, which works alongside application SRE teams. This structure is recommended for larger organisations with many development teams, and where there is a need for centralised standards and approach.
  • Product / Application SRE Teams — these teams are distributed, with each aligned to a particular application (if sufficient large or critical), set of applications, or business area. I.e. a set of SREs aligned to a specific application or business area. Ideally, these teams will tend to act as spokes to the Infra SRE hub team. Without the hub, these teams have a tendency to do their own thing, and we see uncontrolled divergence. This approach of product SRE Teams is recommended for organisations that are sufficient large, or that have outgrown the Kitchen Sink approach.
  • Embedded SREs — in this model, SREs are embedded into development teams. Thus, these SREs tend to be aligned to a specific application or business area. But they are not part of a dedicated SRE Team. These SREs will tend to only be embedded for the duration of the project.
    This approach is useful for product or delivery teams that need SRE capability for a fixed amount of time. However, the SREs can become isolated from the “SRE practice”, and there can be a lack of knowledge sharing between teams. Again, if using this approach, it works best when deployed alongside a “hub” team.
  • Consulting SREs — here, teams are augmented with SRE specialists who are there to upskill and advise the rest of the team. The role of the consultant SREs is to mature the capability of the SRE practice. It may be common to bring in this expertise from an external consultancy.

Identify and Upskill Suitable Candidates for SRE Teams

I’ve touched on this in the previous article. Good candidates will be:

  • System admins.
  • Engineers with automation and scripting experience.
  • Engineers with infra-as-code experience.
  • Engineers with strong IT service management skills, particularly around problem management.
  • Enthusiastic and motivated engineers who want to embrace modern technology, and are not afraid of change.

Having identified these candidates, they can be upskilled in the relevant areas. One way to do this is to put these candidates through Google certified Professional Cloud DevOps Engineers training.

Adopt an SRE Culture

Adopting SRE requires that the organisation embraces:

  • Automation.
  • Infrastructure-as-code.
  • Low tolerance for toil.
  • SRE authority, including well-defined responses if an SLO is breached.
  • A blameless post mortem culture.

It is important for this message to be communicated strongly from the top.

Wrap-Up

That’s it for this article on how to establish SRE in your organisation! Next, we’ll look at landing zone design and technical Google Cloud onboarding.

Before You Go

  • Please share this with anyone that you think will be interested. It might help them, and it really helps me!
  • Please give me claps! You know you clap more than once, right?
  • Feel free to leave a comment 💬.
  • Follow and subscribe, so you don’t miss my content. Go to my Profile Page, and click on these icons:
Follow and Subscribe

Links

Series Navigation

--

--

Dazbo (Darren Lester)
Google Cloud - Community

Cloud Architect and moderate geek. Google Cloud evangelist. I love learning new things, but my brain is tiny. So when something goes in, something falls out!