SRE with IBM Cloud Pak for Data

How IBM Cloud Pak for Data keeps its quality and reliability

Jingdong Sun

Published in

IBM Data Science in Practice

8 min readMar 22, 2021

IBM Cloud Pak for Data is a comprehensive Data and AI platform that can deployed on any cloud or on premises.

SRE (Site Reliability Engineering) was originally developed at Google to maintain service site reliability and reduce human detectable disruption to a desired objective (Service-Level Objective — SLO). It plays a critical role for Cloud services with regards to their reliability and availability.

Cloud Pak for Data, as an on-premises platform, does not need SRE to maintain its “site” reliability per se. However, as a cloud native offering, Cloud Pak for Data needs to give customers high reliability at the platform level for all integrated services.

For example, let’s say a customer is planning to run Cloud Pak for Data with Watson Studio, Watson Knowledge Catalog, Watson Machine Learning, Watson OpenScale, Data Virtualization, DataStage, and Db2 as the image below shows. How to keep all these services working together harmoniously with reliability is a challenge.

Four planes of services and integration — On the first level theres is Watson and IBM tools. On the second level, common services/dependencies, projects, collaborations, notifications, Refinery and Spark. On the third level, the Cloud Pak for Data Control plain, and on the fourth, Red Hat OpenShift.

Integrating SRE technology into the Cloud Pak for Data release pipeline can help resolve this challenge and fulfill the following goals:

Monitor and validate the Cloud Pak for Data platform and services.
Ensure that the Cloud Pak for Data platform and services integrate and meet reliability, scalability, and operational objectives (SLOs)
Gain and share first-hand operation experience with Cloud Pak for Data customers.

This blog provides details of (1) the Cloud Pak for Data SRE architecture and framework, (2) SRE running environments, and (3) how it works.

Cloud Pak for Data SRE Architecture and Framework

The Cloud Pak for Data SRE infrastructure is set up as shown below:

It includes three major components:

Backend and dashboard

This component includes:

a dashboard that shows all metrics for the Cloud Pak for Data cluster and instance health and status.
a backend datastore that keeps all Cloud Pak for Data metrics and histories.

Analytics and Operation

This component handles the following actions:

It collects metrics from SRE environments and parses metrics to get valuable information for Cloud Pak for Data and services.
It generates notifications, opens defects, and triggers the recovery operations based on metrics and analytics results.

SRE Analytics and Operation uses a real Cloud Pak for Data application with Watson Studio, Watson Machine Learning, and Streams services, as following graph shown:

Image shows (1) at the top the real time log analytics using IBM Streams service and (2) at the bottom the static data analytics using Watson Project and Watson Machine Learning services

Streams focus on real time log analytics, so as to give end users (the SRE team) quick feedback about test failures, cluster status, etc.
Watson Studio project and notebooks focus on static data analytics. It uses ML models to give end users full reports of SLIs (Service-Level Indicators) based on logs and metrics.

Workload

To simulate customer environments, the SRE Workload runs service test cases which were generated based on customer scenarios, including the following:

Test cases to cover service-specific customer scenarios.
Test cases to simulate customer end-to-end IT solutions following the AI ladder and using multiple Cloud Pak for Data services.
Multi-tenancy scenarios: Running test cases when deploying (1) multiple Cloud Pak for Data instances within same cluster, or (2) multiple provisioned instances of a service within same Cloud Pak for Data instance, or (3) a service in a tethered namespace.
Operation scenarios like install/uninstall/upgrade/patch/backup and restore/etc
Chaos test
High scaling tests

Cloud Pak for Data SRE Running Environments

A Cloud Pak for Data SRE team maintains three kinds of clusters — test, stage and production clusters, and deploy all Cloud Pak for Data services in these clusters. These include services I mentioned at the beginning of this blog — Watson Studio, Watson Knowledge Catalog, Watson Machine Learning, Watson Open Scale, Data Virtualization, Data Stage, and Db2 and many more besides. These clusters run in different stages of Cloud Pak for Data release pipeline:

Different stages of the Cloud Pak for Data release pipeline: local leads to dev leads to promoted leads to stable leads to stage leads to prod

Test clusters: These clusters integrate into the Cloud Pak for Data release pipeline “stable” stage and run service workload and end-to-end scenarios to bring quick feedback to service development teams based on metrics. These clusters generally run short life cycle for hours.
Stage clusters: These clusters integrate into the Cloud Pak for Data release pipeline “stg” stage and run long-run service end-to-end scenarios, operational scenarios, chaos, and scaling workloads. These clusters generally run life cycle for days.
Product clusters: The clusters mimic the customer production environment, and run many scenarios as “real” customers run for a long time with service upgrade, backup&restore, or migration. By design, these clusters run all the time from release to release.

How does it work?

Following general SRE practices, Cloud Pak for Data SRE tracks everything using SLOs and SLIs (metrics).

Here, I use one important metric — availability to show how does Cloud Pak for Data SRE works.

First, about how to calculate availability metric:

a table with the headers “color” and “description”. The color “red” — this maps to “low: not aligned with high availability — vulnerable to outages”. The color “yellow” maps to “medium: partially exploiting high availability but still vulnerable to outages.” The color “green” maps to “high: fully utilizing technology capability for high availability”

Availability is a metric that describes what percentage of the time a service is functioning. This is also referred to as the “uptime” of a service. A good availability metric should:

reflect end user experience.
its change should be proportional to the change in user-perceived availability.
give the development team insight about why it does not meet SLO.

The Cloud Pak for Data SRE team took the following journey of calculating service availability:

Initial Approach: Based on service pods up and down time:

Initially, the SRE team collected service pods’ metrics and calculated availability as equal to uptime / (uptime + downtime).

The problem of this approach was that Cloud Pak for Data services are distributed and have replicas, so a service pod being down may not impact end user experience. This means that such metrics do not reflect availability to end users. Its change is not proportional to user-perceived availability.

Current Approach: Based on simulated service API probes’ succeed rate:

Since pod metrics can not reflect service availability correctly, the SRE team started to use service APIs to track service uptime.

Using IBM Streams as an example, the following are the method by which the SRE team calculated Streams’ service availability:

Step 1: The team began creating probes using Streams’ service APIs, such as startInstance, submitJob, cancelJob, stopInstance, and so on. These kind of probes are very close to the end user experience, but are lightweight.

Step 2: The team began running these probes with certain intervals — every 30 seconds or one minute initially, and adjusted based on experience.

Step 3: The team also collected metrics such as the probe call pass/failure rate for each API.

If it passes, assume the service is up until the next call.
If it was down, assume the service is down until the next call.

Step 4: Calculate Streams’ service availability as equal to total uptime / (total uptime + total down time). For example, if probes are running with a one minute interval, and in a weeklong period, there are two failed API calls, the availability is 99.98%. If we set availability SLO to be 99.95%, there should be no more than five failures within a week period.

Second, the details of an SRE flow:

Cloud Pak for Data and services support monthly on-demand release. Let us continue use IBM Streams as example, to show how does the Cloud Pak for Data SRE works.

drawing of people interacting with different types of digital experiences

Assume IBM Streams plans a monthly March 2021 release with a new feature. The Stream team plans to (1) develop the feature in the first couple of weeks, then (2) do an integration test at the 3rd week, and finally (3) release the product in the 4th week.

When the Streams team finishes the feature development and complete all of the unit tests and the functional verification test, and delivers to the “Promoted” stage, it triggers an SRE Jenkins job to uninstall/re-install the Streams service in existing test clusters, or to create new clusters when needed.

When the Streams installation is finished, it will notify the SRE backend service with the build version and timestamp, so the SRE service can track the service build history.

As all Cloud Pak for Data services, SRE workloads, and probes keep running in these test clusters, it will tell how well this Streams feature delivery integrates with other Cloud Pak for Data services. The SRE Analytics and Operation component pulls all running services metrics, workload logs, dashboard metrics to generate and report their SLIs to the dashboard.

If SLIs meet the expected criteria (SLOs), notify the build team to promote the Streams service build to the next stage.
If SLIs did not meet criteria (SLOs), notify the team to rollback the service to prior levels in SRE clusters, and reject the service build.

Assuming that SLIs meet criteria (SLOs), the Streams builds goes into the “Stg” stage and trigger Streams installation in SRE stage clusters which have more comprehensive and long-run workload coverage. If any Streams bug is found in these test or stage clusters, the SRE team will open defect reports automatically to the Streams team.

If all goes well after several days’ run, the Streams feature will be released, and SRE product clusters will be upgraded to the latest Streams release.

Conclusion

As a cloud native platform with many services tightly integrated and functioning together, Cloud Pak for Data requires high availability and reliability with all services. SRE plays a critical role to help Cloud Pak for Data and services meet these goals and objectives at the platform level. This blog gave a deep dive of details of how does Cloud Pak for Data SRE works.

To better understand how Cloud Pak for Data services integrate together to give the customer a better enterprise solution, the following links can help: