Speed and resiliency: two sides of the same coin

Published in

IBM Cloud

11 min readMay 7, 2021

Speed and innovation are crucial to the success of any business. Being able to quickly innovate, test, validate, and rapidly release is key for businesses to stay ahead of their competition. At the same time, it is important to ensure that business-critical services have inbuilt resiliency, performance, and scalability.

Speed and resiliency are two sides of the same coin: Customer confidence in the business. To achieve this confidence, the mission critical services should be built on cloud native principles in combination with site reliability engineering (SRE) principles.

The goals of this post are to examine:

What is cloud native and how it ties to SRE
What is SRE and how SRE practices can be part of the development lifecycle
How to measure SRE
SRE organization, and how to measure its effectiveness
What does SRE have to do with AI and ML

The following diagram shows how adopting cloud native leads to SRE efficiency and earning customer confidence. Let’s dive into the flow.

What is cloud native and how it relates to SRE

Ask yourself or your colleagues what the meaning of cloud native, as-as-Service, or cloud-first is. You will get different answers. Responses might vary from “cloud-first” or “born in the cloud” or “cloud-native means microservices and containerization”.

The Cloud Native Computing Foundation (CNCF) defines cloud native as follows:

“Cloud-native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach. These techniques enable loosely coupled systems that are resilient, manageable, and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil.”

Essentially cloud native is all about the balance between resiliency and agility. It’s an approach to build and run responsive, scalable, and fault-tolerant applications that can run anywhere — in public, private, or hybrid clouds. Another lens that can be applied to understand cloud native is the Twelve-Factor App (https://12factor.net/), which consists of a set of best practices that guides the building of applications with built-in performance, automation, resiliency, elasticity, and diagnosability.

Let’s explore the meaning of the following cloud native terms:

Designed for automation:

Automation of development tasks
Test automation
Automation of infrastructure, provisioning updates, and upgrades

Designed for resiliency:

High availability
Fault tolerance and graceful degradation
Backup & restore

Designed for elasticity:

Automated scale up and down

Designed for performance:

Responsiveness with SLO and SLI defined
Efficiency and capacity planning

Designed for diagnosability:

Logs, traces, and metrics

Designed for efficient delivery:

Modular, microservices based
Automated deployments, upgrades, and updates
Efficient build process

These concepts describe SRE practices in a nutshell. Applying these practices in the development life cycle enforces architecture toward common standards.

An important thing to note is that merely containerizing applications as is does not help achieve cloud native characteristics. In fact, these days it is possible to containerize any application; however, it requires additional effort to create a containerized application that can be automated and orchestrated effectively to behave as a cloud native application that is running on a native platform such as Kubernetes. Examples of this are applications that use Kubernetes health probes such as liveliness and readiness probes to enable graceful degradation. For more details, see the blog Are your Kubernetes readiness probes checking for readiness? Going through all the patterns is beyond the scope of this article. Kubernetes provides a portable, extensible platform for managing containerized workloads and services that facilitate both declarative configuration and automation. Going through how each of the cloud native practices can be achieved with Kubernetes will be the topic of my subsequent blog. Some additional resources are included at the end of this post.

Many applications take years to build and are complex. Additionally, many applications are built in a layered architecture, with contributions from a number of teams and technology groups. With a layered architecture, any user action might go several levels deep: from user interaction to authorization to a backend business logic service to automation processing, and there can be additional layers based on use case. To reduce the complexity, improve efficiency, and speed up development, it is critical to apply the lens of cloud native to each layer of the architecture when delivering such a service. Cloud native practices also apply to the software delivery model as well.

What is SRE and how SRE practices can be part of the development lifecycle

Have you ever heard of the expression “SRE is what happens when you ask a software engineer to design an operations team”? If you do a Google search, all of the search results point to a Google SRE. The Google SRE guide is a great place to learn about SRE.

The role of the SRE is to keep the organization focused on what matters most to users: ensuring that the platform and services are reliable. If you are familiar with the traditional disciplines of development and operations, SRE bridges the two. The goal of SRE is to codify every aspect of operations in order to build resiliency within infrastructure and applications. This implies that reliability deliverables are to be delivered via the same CICD pipeline as development, managed by using version control tools and checked for issues by using test frameworks.

In summary, SRE implies operations to be a software delivery problem. SRE uses a software engineering approach to solve operational problems.

In an Embedded SRE model (described in the SRE model section) development and SRE collaborate throughout the lifecycle of MVP delivery. As MVP progresses through technical feature specification and development, the SRE collaborates with Development and OM to ensure cloud native practices are enabled. For example, they identify critical user journeys, associated key SLIs, and SLOs for each component.

The SRE should understand service design, including front end, back end, business logic, and database dependencies. This understanding is critical in order to document all failure points and deliver automation for service restoration. By using service design knowledge, the SRE should ensure delivery of the required automation that is described in the cloud native section.

As illustrated in the following diagram, Development and SRE collaborate to deliver functionality and reliability for MVP by using the same CICD delivery pipelines and release processes while focusing on their success metrics.

No organization starts from scratch. Shift-left for legacy might not be as easy as for new services. Incubating shift-left SRE for new services is a good way to start, and iteratively for existing legacy services.

In some development models, there are concepts of “DONE, DONE DONE” that imply code: DONE, test-automation: DONE, and documentation: DONE. Enabling SRE in a development organization implies DONE, DONE, DONE, and DONE; the additional “DONE” is for SRE enablement.

Measuring SRE

Now as organizations decide to build the development process where SRE and Development work in collaboration to deliver instances of MVP, the question is how do we measure the effectiveness of this process. For this measure, we need to look into the critical metrics committed both externally and internally.

Service Level Agreement (SLA) — SLA reflects customer expectation. It sets a promise to the consumer in terms of service availability and performance. There are business consequences if promises are not kept.

Service Level Objectives (SLO) — SLO are the reliability and performance goals set by the service for itself. These are visible internally. Every service should have an availability SLO. The SLO decides how much investment is needed in the reliability of a service. More critical services should have higher SLOs. From the SRE perspective, SLO is what defines the goal that SRE teams have to reach and measure themselves against.

Now the question is: how is SLO defined? The metrics that define SLOs should be limited to those that truly define performance measures. Every service should consider client-side impacts when defining these metrics.

Service Level Indicator (SLI) — SLI is the metric that enables measurement of compliance against SLO. Think of SLIs as a set of Key Performance Metrics (KPIs) that matter to customers. It is important that SRE, Development, and OM reach agreement on the SLIs that define SLO, and hence SLA.

See the following diagram for examples:

Here is how these three metrics (SLI, SLO, and SLA) are related: The service needs to collect KPIs that define the SLI for the service. The service defines thresholds of metrics based on SLO, and monitors the thresholds of metrics so that it does not violate the SLA.

In other words, SLIs are the metrics in the monitoring system. SLOs are alert rules, and SLAs are the numbers of the monitoring metrics, applying to the SLOs.

The SLI and SLO definitions should be collaboratively agreed upon by the Development, SRE, and the Service Offering team. Going with the definition of “you build it, you run it”, each service in the the layered architecture should identify the KPIs for their service and make them measurable. These KPIs are the SLIs that define the SLO for each service.

As mentioned earlier, SLA is external and should not be better than the SLO. SLA is normally a looser objective than the internal SLO and relies on a subset of metrics that make up the SLO.

Resiliency isn’t something that just happens, it takes time and is iterative. It is a result of the organization’s support towards operationalizing the SRE model that’s sustainable and resilient itself. SRE is only as good as the organization supporting it.

Effective SRE depends on how well the SRE model is established.

SRE organization, and how to measure the effectiveness

No matter how well we architect and design a service, failure at some point is inevitable. Aiming to design 100% uptime is futile. Aiming for 100% uptime will slow down the development of new features and functions, and that adversely affects consumer satisfaction as well.

Minimizing mean time to recovery (MTTR) is the other side of the SRE coin. By recovering quickly when things go wrong, customers still perceive the service as reliable. How quickly a service returns back to running state highly depends on the operational model of the SRE team.

Let’s look at some of the SRE team types.

Embedded SRE refers to SRE embedded in the functional development squads. Development and SRE work together to deliver application performance and reliability by using the same development CICD delivery pipelines and release processes, but they each focus on their own metrics of success.

Development focuses on the speed of release of new function
SRE focuses on enabling resiliency and reliability for the feature functions being delivered

Dedicated SRE refers to the dedicated SRE team responsible for keeping the service up and running. This team is efficient once the service is mature and stable with established automation and runbooks in place.

Platform SRE refers to the SRE team that takes care of the underlying platform where the services are running, including Kubernetes Cluster, Network, Storage, and so on.

A lot of organizations start with a pure Dedicated SRE + Platform SRE model, which arguably is a traditional Operations model. Organizations soon realize that SRE needs to start early in the cycle, needs to know the service components really well, and be part of the Software Development Life Cycle (SDLC) boosting reliability. Once that realization comes in, they move to a hybrid approach of an Embedded + Dedicated + Platform Model.

A good way to measure the effectiveness of the SRE model is to explore MTTR and other associated parameters, as illustrated in the following diagram:

MTTR: Mean Time to Recovery, is the average time to recover the service in the event of an outage. MTTR is dependent on the following key metrics:

TTD: Time to detect outage or alert indicating potential outage. This depends on 1. Quality of the ‘ticket’ system, and how quickly the correct SRE is notified. 2. Monitoring Monitoring key SLIs by setting up a threshold alert to automatically detect an outage

TTE: Time to Engage. This depends on the quick routing of the issue to the right SRE. And this is where the SRE model is very critical in reducing the MTTR

TTF: Time to restore the service. This depends on how well the SREs know the failure points, and how well the automation is in place for recovery

TTBD: Time to build and deploy identified bugs or additional identified automation

The SRE model should be designed with the goal of minimizing MTTR to gain customer confidence.

What does SRE have to do with AI and ML

This pattern of shifting SRE practices to the left, putting SRE measures in place, and optimizing SRE organization leads to reducing MTTR. As I mentioned previously, failure at some point is inevitable. The best that can be done is to be prepared for it. This preparedness is even more critical now that businesses are going digital at lightning speed. SREs have to deal with a variety of IT data including logs, tickets, metrics, events, alerts, and more. As businesses are moving to hybrid and multi cloud, the SREs are observing an explosion of this IT data. This shift has added tremendous stress on SRE teams, as the increase in data has accelerated complexity that limits the ability to respond quickly.

This is what is motivating IT leaders to turn towards AIOps. AIOps enables SREs to harness AI and machine learning (ML) to prevent incidents and to respond to them faster.

Gartner defined AIOps as Artificial Intelligence for IT Operations. “AIOps combines big data and machine learning to automate IT operations processes, including event correlation, anomaly detection, and causality determination”.

IBM Cloud Pak for Watson AIOps uses AI and ML to make sense of the IT data by Observe, Learn, Act, and Optimize; relieving the manual toil that is associated with the challenging SRE role and enabling organizations to speed up the development and delivery of new feature and functions. Going through the details of this will be the subject of a future blog. To learn more about the IBM Cloud Pak for Watson AIOps, see the product documentation.

Conclusion

Finding a balance of speed and resiliency requires a shift in the mental model. SRE is not just a set of practices and policies. It is a culture and mindset on how to develop software. I hope you found this blog interesting and information. If you or your organization have not embraced some of these practices, socialize so that you can incorporate and take advantage of speed and resiliency in your practice.

Thanks to Stacy Pedersen for reviewing this article!