Deployment risk as a Quality Gate

Published in

EPAM Delivery Platform

10 min readOct 24, 2022

Hi, my name is Sergiy Kulanov. I’m working as a Systems Architect at EPAM, Ukraine. We are building the EPAM Delivery Platform (EDP) to minimize your time and effort around establishing CI/CD Practices for products. The general overview of the platform is far beyond today’s discussion, but we will address this in the following articles.

Abstract

Modern systems produce many logs we analyze in case of issues or troubleshooting incidents. Several solutions use a static approach based on a search for specific words in records, such as “exception”, “failed”, or “error”. Manual Log analysis is time-consuming, but it is still used in many cases. That’s why we need an automation process that helps highlight log anomalies to prevent potential system failures. For example, an automatic analysis that uses the Artificial Intelligence and Machine Learning (AI/ML) approach might be applied to both phases of the Application Lifecycle: Development and Operational.

This article describes how we adopted the deployment risk analysis metric as a Quality Gate during the Application Development Phase using EDP and how we use this quality gate in our platform development process. What are the pros/cons and things that might be improved in the future? We are not comparing different techniques and models, like supervised versus unsupervised learning. This topic is out of the scope of the article. Instead, we want to run this verification early during development and give feedback on whether the deployment risk analysis brings us value.

Products overview

EDP is an open-source cloud-agnostic SaaS/PaaS solution for software development, licensed under Apache License 2.0. It provides a pre-defined set of CI/CD patterns and tools, which allow a user to start product development quickly with established code review, release, versioning, branching, and build processes. These processes include static code analysis, security checks, linters, validators, and dynamic feature environment provisioning. EDP consolidates the top Open-Source CI/CD tools by running them on Kubernetes/OpenShift, enabling web/app development in isolated (on-prem) or cloud environments.

Logsight is an option we go with for the deployment risk analysis. The solution provides different features for log processing, like Auto Logging, Stage Verification, and Incident Detection. This article covers only the Stage Verification module, which performs “log data analysis results between the previous and the next release to get insight into the deployment risk score.” For example, we might consider release as a particular state in an application release lifecycle. We can use the more generalized term “state” and analyze deployment risk for applications before and after adding a feature, for example, during the end-to-end (e2e) test phase. In this case, we can compare logs and assume possible risks.

Many products provide Log analytics capabilities, like Splunk+Autopilot. Another excellent example of applying AI/ML to CI/CD Pipelines is described in Harness blog. Harness also has a good description of how they perform Continues Verification using different providers.

Deployment approach

Logsight.ai might be used as PaaS or SaaS solution and is available with two plans: Free Community and Enterprise. In the case of EDP, we use the PaaS approach by deploying Logsight with our Delivery Platform inside the Kubernetes cluster. On-premises deployment (PaaS) is a case when logs should be kept inside an organization and do not leave a secured perimeter. PaaS solution consists of six components, which run as containers inside the cluster and can be deployed using helm-charts:

Core service — the core of the system.
Result-API service — responsible for obtaining results, for example, during the verification.
Frontend — UI service.
Backend — SpringBoot Application — provides all necessary APIs and user management.
PostgreSQL — Persistence Layer.
ElasticSearch — Persistence Layer.
(Optional) Kafka — for scaling.

See the deployment diagram below:

As a part of the Observability stack, EDP supports two Logging solutions: EFK (Elastic, FluentBit, and Kibana) and PLG with multi-tenancy (Promtail, Loki, and Grafana). Although EFK goes with EDP, we recommend deploying a separate instance of ElasticSearch+Kibana for Logsight.ai. This approach adds additional support overhead but follows a “separation of concerns” approach and enables managing the components independently, for example, scaling or accessing. Modifications are applied on the Logs aggregator level, which is fluent-bit in our case, to transform and route the logs for AI/ML analysis in an expected format. See the deployment schema and integration with an existing EFK stack below:

Logs aggregation and analysis flow with two ElasticSearch clusters

To reduce the size of the logs under analysis, we recommend filtering/routing only the required records, such as specific namespaces or applications, to the Logsight core. Below is an example of the Fluent-Bit config for getting logs from the edp-sit namespace:

Get Deployment Risk as a part of the EDP CD Pipeline

The stage verification module provides the deployment risk value. The Logsight.ai official documentation explains in detail how this stage works in a nutshell. There are two ways to get the Deployment Risk value: manually using the Logsight User Interface or programmatically with API. The API is common practice for CI/CD Pipelines integration. Logsight supports GitHub Actions to perform verification steps from GitHub Workload.

EDP currently supports only two CI Tools:

Jenkins (in a stable version for EDP), which consists of the jenkins-operator and two libraries: for pipelines and stages.
Tekton (alpha version, planned to be released by the end of 2022), which consists of the EDP interceptor and Tekton Pipelines.

Today, EDP supports the deployment-risk step only in Jenkins, but the plan is to add the Tekton task. This step performs the following actions:

Consolidates information about the components currently and previously deployed in the Environment.
Provides information from the step above to the result-api-service from Logsight.
Preserves analysis results as an artifact.
Provides handy links to the reports.
Fails pipeline in case of violation. There is an option to skip the results, which is helpful in early Logsight adoption on a project and running Quality Gate in a non-Vote format.

The image contains the EDP CD Pipeline with consecutive runs.

CD Pipeline for the EDP lower Environment with two Quality Gates (autotests and deployment-risk)

Examining Quality Gates on EDP

EDP consists of several CI/CD tools and Kubernetes Operators that perform tools management and configuration. A typical Environment runs up to 17 deployments, and we run Integration Tests (Autotest step) as a Quality Gate for the lower Environment. Each time a new change is merged into any of these components, we automatically perform the Platform Deployment and Integration Testing. This approach ensures we are changing incrementally with a small batch. When the first Quality Gate (Autotest) passes, we run the Stage Verification for each EDP component. If any of them has a Deployment Risk above 70%, we mark the pipeline as failed and block artifacts promotion to the higher Environment. We intentionally run the deployment-risk step after the autotests to ensure our components produce enough logs from the deployment and operational (business logic logs) phases. See the run of the pipelines below:

Let’s investigate the root cause of both failures.

Autotests

EDP provides Allure Framework with Jenkins integration to perform test analysis. In the next EDP version, we plan to switch to ReportPortal.io as a unified platform for Tests Aggregation and Analysis. Follow the EDP documentation to install ReportPortal.io.

Allure Report provides all the details regarding the Pipeline (BUILD_ID 129) failure:

By checking the Environment and the detailed information about the failed test, we conclude that there is an issue with the Review Pipeline:

Deployment Risk

Now let’s analyze the Deployment Risk value for the Pipeline with BUILD_ID 127. We use Pipeline Logs and Logsight UI to investigate the root cause of the high Deployment Risk value. Examine the Logsight.ai API response from the Jenkins logs below:

{"link": "https: //logsight.example.com/pages/compare?compareId=xxxxx","baselineTags": {"namespace": "edp-delivery-eks-sit","container": "jenkins-operator","image": "012345678910.dkr.ecr.region.amazonaws.com/edp-delivery/jenkins-operator:2.13.0-SNAPSHOT.6"},"candidateTags": {"namespace": "edp-delivery-eks-sit","container": "jenkins-operator","image": "012345678910.dkr.ecr.region.amazonaws.com/edp-delivery/jenkins-operator:2.13.0-SNAPSHOT.7"},"compareId": "xxxxx","risk": 79,"totalLogCount": 1115,"baselineLogCount": 630,"candidateLogCount": 485,"candidateChangePercentage": -14.0,"addedStatesTotalCount": 11,"addedStatesReportPercentage": 81.0,"addedStatesFaultPercentage": 19.0,"deletedStatesTotalCount": 37,"deletedStatesReportPercentage": 54.0,"deletedStatesFaultPercentage": 46.0,"recurringStatesTotalCount": 144,"recurringStatesReportPercentage": 98.0,"recurringStatesFaultPercentage": 2.0,"frequencyChangeTotalCount": 16,"frequencyChangeReportPercentage": {"decrease": 32,"increase": 62},"frequencyChangeFaultPercentage": {"decrease": 7,"increase": 0}}

We are comparing two versions of jenkins-operator: 2.13.0-SNAPSHOT.6, which is the baseline and was previously deployed on this Environment, and the new candidate 2.13.0-SNAPSHOT.7. Most of the parameters are self-explanatory, and we are interested in the final “risk” value, which equals 79. Please refer to the Logsight.ai documentation to understand how this value has been calculated. To troubleshoot verifications, we use the UI part and perform the following steps:

Identify the states added to the State analysis table.
Identify the states characterized by Level = ERROR and/or Semantics = FAULT in the new version.
Identify the recurring states with Level = ERROR and/or Semantics = FAULT that have a HIGH frequency of changes.
Use the state description and locate the source file and the line number, which generated the state.
Fix the errors and faults in the codebase and restart a deployment.

Let’s start with the analysis of the states. Overall, the number of logs decreased for the candidate by 14%, from 630 to 485 lines. Eleven more new states (new log entries) were added, which were absent in the baseline version. Logsight classifies these new states as the “Report” (81%) and “Fault” (19%). A high number of faulty new states increases the estimated risk. We have stable “Recurring states” with only 2% of “Fault”. The good news, we have removed 37 states, and almost half of them (46%) are “Fault” by Semantics, so this is zero risk impact.

The next step is to sort the states by Semantics to get the “Fault” state on the top of the table and review them individually. These states have caused high-level Deployment Risk.

In our example, we see more attempts of the jenkins-operator to set up the keycloak-client (Frequency change from 11% up to 88%). It happens because the keycloak-operator hasn’t finished a Keycloak configuration yet:

{“level”:”error”,”ts”:<:NUM:>.<:NUM:>,”logger”:”controller-runtime.manager.controller.jenkins”,”msg”:”Reconciler error”,”reconciler group”:”<:ID:>.edp.epam.com”,”reconciler kind”:”Jenkins”,”name”:”jenkins”,”namespace”:”edp-delivery-eks-sit”,”error”:”Integration failed: Failed to get Keycloak Realm for jenkins client!: unable to get keycloak realm owner: owner not found”,”errorVerbose”:”owner not found<:URI:> to get keycloak realm owner<:URI:> to get Keycloak Realm for jenkins client!<:URI:> failed<:URI:>

The second row in the table, which contains the newly added state, shows that there are some issues with accessing Jenkins indicated by a Bad Gateway error:

{“level”:”error”,”ts”:<:NUM:>.<:NUM:>,”logger”:”controller-runtime.manager.controller.jenkinsscript”,”msg”:”Reconciler error”,”reconciler group”:”<:ID:>.edp.epam.com”,”reconciler kind”:”JenkinsScript”,”name”:”jenkins-config-keycloak”,”namespace”:”edp-delivery-eks-sit”,”error”:”Getting Crumb failed! Response code: <:NUM:>, response body: <html><:URI:> Bad Gateway<<:URI:> Bad Gateway<<:URI:> Crumb failed! Response code: <:NUM:>, response body: <html><:URI:> Bad Gateway<<:URI:> Bad Gateway<<:URI:>

Let’s examine the Deleted states to identify the issues we’ve had with the new version. A lot of them contain permissions errors, such as:

{“level”:”error”,”ts”:<:NUM:>.<:NUM:>,”logger”:”controller-runtime.manager.controller.jenkinsserviceaccount”,”msg”:”Reconciler error”,”reconciler group”:”<:ID:>.edp.epam.com”,”reconciler kind”:”JenkinsServiceAccount”,”name”:”gerrit-ciuser-sshkey”,”namespace”:”edp-delivery-eks-sit”,”error”:”jenkinsserviceaccounts.<:ID:>.edp.epam.com <:URI:> is forbidden: User <:URI:> cannot get resource <:URI:> in API group <:URI:> in the namespace <:URI:>

Our candidate hasn’t introduced any security fixes, so there might be some issues with the baseline version deployment.

The steps above describe how to perform the analysis, treat the deployment risk, and help to examine the potential issues.

Conclusions

We described an example of how the Log Analysis can operate a more sophisticated approach, in our case, the AI/ML, on a project. We can apply it not only for the operational phase of an application but also during its development. Today, many solutions that deal with Log aggregation provide such capabilities. Therefore, it can be valuable to understand the behavior of your application from Logs.

For the EDP project, we get around 40% of failed pipelines in the lower Environments and a High Deployment Risk value. It happens because we are running Kubernetes operators that are stateless by design, resilient to failures, and constantly running a reconciliation loop. For Kubernetes, it is expected to have a lot of moving parts that finally acquire the desired state. Sometimes it might take 5 minutes, and sometimes 25, leading to a high number of States with Frequency change in the Logsight.ai models.

Based on the result of the Deployment Risk value, we plan to introduce the following changes on the platform:

Tune the risk calculation process to be more tolerant of the retries for the operators.
Create the specific CustomResources (CRs) after the dependent resources are provisioned. For example, it might take up to 15 minutes to deploy SonarQube, Keycloak, and Gerrit in Kubernetes.
Review reconciliation loops and exponential back-off behavior for some controllers.

References

EPAM Delivery Platform — https://epam.github.io/edp-install/
Logsight.ai — https://docs.logsight.ai/