SRE

What if you had an opportunity to build a Core Banking System from scratch? (Part 3)

Marcus Leandro

Published in

BTG Pactual Developers

6 min readJun 15, 2020

Release Engineering

More than just CI/CD

Our Core Banking application counts with at least a hundred of micro-services such as Docker containers running on different Kubernetes clusters and serverless applications running as AWS Lambda functions. In addition to we have a cloud-based infrastructure composed of S3, ApiGateway, CloudFront, SNS, SQS, DynamoDB, Elasticsearch, EC2 resources, etc.

Think about the challenge to promote an application to a production environment in the middle of this huge and complex environment. To help us we based our processes in the principles of Release Engineering present on Google’s Site Reliability Engineering (SRE) book.

In the next seven topics, I’ll reveal to you our pathway to achieve an automated, fast, secure, and regulatory compliant Release Engineering process.

Adopting a common versioning model

The first step was about the application versioning.

The application’s version is the key information used in the delivery process.

Whether adopting a canary release, a blue-green, or any other deployment strategy, the application’s version is the information used to guide the entire process.

The application’s version can deliver important information to developers and architects letting them make important decisions about software engineering and solution architecture.

After a team consensus, we decide to use the Semantic Versioning 2.0.0 to versioning all of our applications and shared libraries.

As the SemVer rules our application follows these principles:

MAJOR version when you make incompatible API changes,
MINOR version when you add functionality in a backward-compatible manner, and
PATCH version when you make backward-compatible bug fixes.

After adopting the versioning model we decided to automate the application’s versioning tagging in our CI process.

Since we know the SemVer logic we can ensure that every build with code modifications will generate a new application version correctly, relying on the developers only the action to tag the application repository when he desire to release a new stable version which triggers the entire CI/CD process.

Self-Service Model

“In order to work at scale, teams must be self-sufficient…” — Google SRE book

It’s a fact here on BTG Pactual.

In order to accomplish this requirement, we decide to build a new CI/CD process from scratch over the Azure DevOps platform.

As our project management system, the Azure DevOps offers the complete integration between the software user stories and requirements and the software development life-cycle. Also, the Azure DevOps offers a complete toolset to build a CI/CD process. It’s a perfect merge.

Our CI process supports our Development Process (based on GitHub flow), the original GitHub Flow, and the Git Flow. Every team squad is free to choose the more suitable development process and the CI automation will be capable to generate the correct application’s version respecting the SemVer 2.0.0 rules.

Depending on the application version the CD process knows that artifact is a feature test, a beta test, or a stable application and guides the artifact to the correct environment.

Our releases are truly automatic, and only require engineer involvement if and when problems arise.

High Velocity

The Core Banking application evolves continuously to deliver new features to our end users as fast as possible.

To accomplish that we trigger the Build Pipeline for every repository push and keep enabled the Continuous Deployment Trigger in the Release Pipelines.

Each team squad maintains his own Q.A. subprocess to ensure software quality and backward compatibility.

To increase the delivery velocity we use self-hosted agent pools. Each stage of the CI/CD process runs in a different pool and each pool contains multiple agents giving the possibility to run parallels jobs.

Hermetic Builds

“Build tools must allow us to ensure consistency and repeatability…” — Google SRE book.

We use Azure Pipelines Templates in our Continuous Integration Process. So we developed build templates for each flavor of applications and libraries. These templates are also versioned, ensuring the repetition possibility.

Every build run points to a commit hash and generates artifacts that are versioned.

Configuration Management

Every Release saves its state including his jobs, tasks, artifacts, and variables.

There are some variables common to all applications. There other variables common to all environments of a single application and there are variables values for a specific environment of a specific application such as external API endpoints, etc.

To manage that we use the Variable Group Feature of the Azure DevOps.

With Variable Groups, we can centralize the configuration management of the Core Banking Application giving us more agility to change some common configuration values and even to build another environment from scratch.

Deployment automation

If a human operator needs to touch your system during normal operations, you have a bug. — Carla Geisser, Google SRE

Keeping this concept in mind we automated our deployment process.

As well as we used the Azure Pipelines Templates in our CI process, in the CD process we adopt the Task Groups functionality.

With Task Groups we can standardize the way to deploy different kinds of applications such as Containers and Lambdas and even the infrastructure because we develop our Infrastructure As a Code using CloudFormation with YAML notation.

The Task Groups can be configured using variables that give to developers the power to change any needed aspect of the deployment.

The bug fixes and the process improvement is easy using Task Groups because all the management is centralized.

Enforcement of Policies and Procedures

Like a Full Bank, we have to be compliant with many regulatory rules of the Central Bank, and also as a listed Company, there are other regulatory rules there we have to obey.

The Governance IT Team is responsible to ensure that all the mandatory steps in the Change Management Process are being respected. This Team is also responsible to generate and to guard the evidence that the team squads are respecting these rules.

All the approval chain and the evidence generation of the User Acceptance Tests happen in the governance software.

We use the Gates Feature available in the Azure DevOps Release Pipelines to communicate with the IT Governance Software API.

Before running the deployment in the Production Environment we open a Change Request in this API using a Gate.

A Gate is nothing more than an API call and continuous polling until a success condition is reached.

A second Gate will check the API endpoint every x minutes until all the pre-deployment checks are satisfied.

The pre-deployment checks are composed of the approval of the Product Owner and the attachment of the user acceptance tests evidence on the Change Request.

After the Gate pass the deployment is executed and after the deployment runs successfully we mark the Change Request as Implemented and Closed.

In case of failure in the Deployment, the Change Request is marked as failed.

Conclusion

Knowing the Google SRE concepts and making use of the features present on Azure DevOps, we reduce the time between a code commit and the release of this commit to the Production Environment.

This process is under continuous improvement aggregating new features, reviewing, improving, and fixing the current implementation.

Probably you will not apply the full Release Management process since the beginning (as we didn’t) but the important is to keep the base concepts in mind and evolves the process with the time.

SRE