Producing Observability Design to Support a Hybrid Cloud Strategy

Published in

Hybrid Cloud How-tos

6 min readNov 8, 2022

Photo by Frederick Marschall on Unsplash

The IBM CIO Hybrid Cloud Platform team has been working on a major cloud transformation project. We’re building a hybrid cloud platform for our entire internal infrastructure using Red Hat technology, aiming to have a modern platform promoting agility, innovation, and efficiency and reducing operational costs.

Implementing a hybrid cloud platform has clear benefits, but operating this new platform creates new complexities. Virtual machines can become hundred of pods and containers, monolithic applications transform into a complicated set of smaller independent components, and you can no longer troubleshoot a small number of servers and components to diagnose performance problems. In this more complex scenario, the answer to finding root causes and identifying how components flow across different regions and clusters is observability.

In this article, I will discuss how we developed our observability strategy to support the organization’s hybrid cloud journey, some lessons learned, and the results of the platform adoption from an observability perspective.

Designing an observability strategy

Our first step in designing an observability strategy for this new hybrid cloud platform was evaluating the size of the challenge so that we could define how to do it.

After defining the organization’s strategy, main requirements, resources, and budget for the project, we started to organize the team to develop the hybrid cloud transformation’s observability strategy.

Our observability target is to support the IBM CIO Hybrid Cloud organization in its hybrid cloud transformation. Specifically, we want help the organization manage the new, complex infrastructure; reduce operations efforts; improve application and infrastructure reliability, performance, and efficiency; and contribute data-driven strategies.

We based our definitions on these functional requirements:

- Define observability targets for all applications using Site Reliability Engineering’s (SRE) four golden signals. This helps designate monitoring maturity levels to assess how well we are doing.

- Use existing tools to reduce internal development as much as possible. The wisdom of this early decision was reinforced after some frustrating attempts to develop internal solutions.

- The observability solutions must cover the new hybrid cloud and existing legacy platforms. This gives us full observability for applications that, for example, are on the hybrid cloud but access data on mainframe systems.

- Funding comes from the organization rather than using internal cost recovery. Initially, we tried funding through cost recovery by charging teams for the new tools. But this proved to be a poor way to expand observability because some teams declined because of the cost.

- We will provide two observability solution models: self-service and consumption-as-a-service. This was another important decision because some teams prefer to take care of their own monitoring configuration while others don’t have sufficient knowledge and want another team do it for them.

Supporting cultural transformation

Observability is not only about using tools to provide more visibility into systems; it can also completely change how teams work. Observability puts more focus on the application layer than infrastructure, which changes the perception of urgency. Sometimes it’s not critical when an infrastructure component is down, but high latency on a system is always important. If this happens, the team shouldn’t need to go server by server to investigate a problem, rather they can check the observability solution and extract all information they need.

For this reason, we invested significant time creating documentation, videos, knowledge-transfer sessions, training, and quick-starts to help teams reach their observability targets.

Defining the solution design and architecture

Aiming to have an open and live architecture to promote innovation and best practices, we are updating our architecture based on new market offerings and customer needs.

As the architecture reference diagram below shows, our target is to have an environment that merges proprietary agents with generic agents (OpenTelemetry) or manual instrumentation to avoid vendor lock-in. Another important idea here is to have a solution to manage the API calls to our tools to add a security layer for external entities to access our data. We also want to provide a workflow solution to orchestrate all interactions and a database to maintain the important static external data required to create dashboards.

Functional requirements:

Our solution will:

- Define all tools required to cover the requirements.

- Provide application performance management (APM), synthetic performance monitoring, real-user monitoring (RUM), and logging solutions for heterogeneous environments.

- Integrate all tools as a single observability solution.

- Notify the relevant teams when a problem happens.

- Automate infrastructure issues as much as possible.

- Manage infrastructure resources efficiently.

Important architecture decisions

Our solution will:

- Implement hybrid solutions (SaaS and on-premises) but prioritize SaaS solutions where it fits.

- Balance multi-tenant and single-tenant use with role-based access controls.

- Avoid creating new infrastructure components as much as possible, and set any that must be created to be highly available.

- Not migrate legacy components or tools that will be decommissioned soon.

Summarizing our observability solution

The following diagram summarizes our observability solution. It starts with onboarding the application, passes through application data collection, then makes infrastructure optimization suggestions, and finally, delivers all the observability benefits.

The workflow below shows the observability solution interactions. APM tools (Instana) get all the application and infrastructure data. Each tool processes all the data to create topologies and tracings; improve root-cause analysis (RCA); identify issues; send data to the logging tool for dashboarding; send alerts to IT operations management and IT service management tools for event correlation, ticketing, event automation (Ansible), and user notification; send data to the ARM solution (Turbonomic); and support the application and platform teams to use the infrastructure with greater efficiency and lower hosting costs.

Reviewing our results

We reduced operational efforts by providing:

- An easy way to identify dependencies

- Automation with RCA details where required

- Automatic diagnoses and RCA through artificial intelligence (AI)

- Fewer tickets by correlating and contextualizing them

We improved customer experience by:

- Resolving critical business issues faster

- Delivering better applications by analyzing their performance

- Supporting the need to implement new features and find problems in real-time

- Identifying and fixing performance problems before user experience is impacted

We improved business visibility with:

- Centralized information about application dependencies

- Centralized information for dashboards and reports

Identifying and quantifying issues with high business impact

Evolving our observability solution

Rather than deploying our entire observability solution at once, we evolved it incrementally. As the image below shows, we started with infrastructure monitoring, then added application availability monitoring, heterogeneous application monitoring, SRE, and finally SRE + AIOps.

We found that our observability solution was very successful. We are monitoring 2,000 applications, support 12,000 user accounts, resolve 18,500 incidents automatically, and have full observability into more than 650 applications.

Conclusion

Observability is an important piece of a hybrid cloud strategy and can make our journey easier and more exciting. It helps us connect the dots of complex hybrid solutions, providing intelligence, visibility, and efficiency for infrastructure and applications spread across different technologies and platforms.

Having a good understanding of the organization’s strategy, main requirements, and operational models can guide you to create a well-designed architecture with the best tools and solutions. Building an excellent technical solution and having organizational support to transform the culture are the necessary ingredients for a successful observability project.

Tiago Dias Generoso is a Distinguished IT Architect | Senior SRE | IBM Master Inventor based in Pocos de Caldas, Brazil. The above article is personal and does not necessarily represent IBM’s positions, strategies or opinions.