Accelerating Software Delivery By Unifying CI/CD and Observability

Wayne Greene
DevOps Dudes
Published in
5 min readJan 7, 2021

Co-written with Seetharam Param, CEO of ReleaseIQ.io

I was reminiscing with a friend and former colleague of mine recently about how we both have experienced product and service releases that were under tremendous pressure to deliver a new feature or capability and were already behind schedule. Along with the highly charged environment and the pressures on the team, people were already tired and quality issues started creeping into the builds. Look, our teams had the best of intentions but being behind on a critical release meant that ultimately quality there was an impact on quality. And, when we finally managed to get these challenging releases “out the door,” too many defects escaped with them and made just about everything worse. Now, not only are we delaying revenue and missing opportunities, the high-priority need to fix critical defects flattens our capacity for new projects, compounding frustration and opportunity costs.

Building and maintaining great CI/CD pipelines is hard, but doable

These formative (and painful) experiences instilled in me a passion to commit my professional career to resolve these two critical issues:

  1. How we make the engineering team more productive, so we can accelerate the releases.
  2. How efficiently we identify and resolve the offending commits early in the release cycle and gating them from reaching production.

Fortunately, early in my career I had the opportunity to learn from folks that were very good at building efficient release processes that delivered consistent high quality releases. In particular, I learnt a lot from BEA Weblogic’s Engineering team which built one of the most efficient and automated release processes with a lot of emphasis on quality. I was also part of a talented team in VMware which was responsible to set up the release processes and CI/CD pipelines for the on-prem and cloud native applications. It was painful, but we successfully integrated the CI/CD pipelines with a few monitoring tools to help the developers to debug the pipeline issues.

Why doesn’t everyone build great CI/CD pipelines, to accelerate software delivery?

We weren’t the only ones with both the scars of painful releases and the experience of well-executed release processes. If we learned the lessons, lots of other folks had as well. Why wasn’t everyone running release processes with best practices?

For one thing, it can be pretty challenging to keep pre-production environments stable and fully functional. One very simple and insidious reason is that when we run automated tests in pre-production environments, we invariably have the dilemma of what to do when the tests fail. Is it a “real” failure, or is it an artifact of the test environment? If testing in pre-production environments is perceived to be flaky, it becomes easy for people to start ignoring the test failures and keep promoting the commits to the next stage in the pipeline. This behavior, in turn, results in having to perform regressions in production environments, which are easily 10x more expensive to fix.

The sad truth is that it is all too common to wind up detecting major regressions in production that were already caught in pre-production, but nobody examined and triaged that failure before the code was promoted to production. Not only does this consume time and money with regressions in production environments, think about all the time that we wasted in automating and running tests, the results of which are being ignored.

So, why does this happen over and over again?

  • Not because we don’t understand what’s going on
  • Not because we don’t know how to minimize these failures
  • Not because industry analysts haven’t been telling us how important effective automation is

It turns out what keeps most companies from investing on setting up the fully automated CI/CD pipeline are:

  • Historically, it has been hard and resource intensive
  • The CI/CD eco-system is constantly evolving and the existing tools have simply not kept up with rapidly evolving developer requirements
  • Outage in production is very visible due to the business impact it causes, but the pre-production issues are ignored because the tangible benefit of stable pre-production environments is not explicitly visible to decision makers

Existing CI/CD tools are failing developers (and businesses that depend on them)

Developers, and the businesses that depend on them, need a platform that helps them to diagnose the build, deploy and test issues quickly AND that encourages consistent “high standards of governance and efficiency” across an organization.

Succinctly, developers need an aggregated view of all the sources of issues to help them to find the root cause of the problem quickly. That will reduce the MTTR for fixing the pre-production issues and keep the pre-production environments functional and stable.

The nature of a typical CI/CD environment where build, tests and deployments run full throttle and the demands of quick turn around to troubleshoot, fix and redeploy is key to optimize the CI/CD process. DevOps environment generates an overload of logs. Information overload can make Root Cause Analysis (RCA) a nightmare. Hours if not days are spent manually in troubleshooting issues especially when it manages to creep into the production environment. Often, it is not the lack of logs or metrics data or traces or information in general but it is indeed a problem of excess. Sifting through the noise, maze of alerts, warnings and irrelevant logs to focus on the ‘errors that matter’ in quick time is crucial to convergence of root cause.

Complexity is costly: draining productivity & quality

Side note: In many cases, pre-production environments only require monitoring when there are code promotion events. Personally, I don’t see a need to keep collecting logs 24×7 if those environments are used only once or twice per day to validate code commits. To do this, we need an observability solution to collect the relevant logs and metrics ONLY when we really need it to debug the CI/CD pipeline issues including the test issues.

So what does this mean for practitioners who want to systemically improve their software delivery pipelines? First of all we believe there is an opportunity for a new approach to address these issues, something that brings together CI/CD and observability. Let’s find tools and systems that actually help developer and pipeline productivity. Automation is important, but have the right mix of automation and log monitoring during the release pipeline is critical to shifting left.

Learn more about ReleaseIQ.io’s approach to this issue here.

--

--

Wayne Greene
DevOps Dudes

#prodmgmt and marketing exec/consultant on strategy/execution, author, coach, 41k miles cyclist