Avoid becoming the team that can’t release

Published in

The Qonto Way

6 min readJul 18, 2023

Our release strategies and CI

Qonto’s Web team consists of 60 engineers delivering fantastic improvements and features to our customers on a daily basis. We provide multiple ember apps for different experiences; on the biggest app, customers can manage transactions, order new cards, create spending limits, and more.

We practice continuous integration to break features down into smaller parts that can be tested, reviewed, and merged quickly. I recommend these insightful articles by Atlassian if you’re interested in the topic.

This method stays true to our Tech team’s motto: build quality fast!

The team that couldn’t release

Story time. For a previous work assignment, I was part of a team of ~150 engineers. We were releasing features and improvements on a monolith app containing both backend and frontend code.

A big number of tests supported the codebase’s integrity and its legacy features. These tests would fail regularly but inconsistently, which made them hard to fix. Almost all teams kept ‘replaying’ them until they passed, which could delay releases by hours or even days — resulting in waste and frustration.

This ‘replaying until it passes’ culture undermines the value and quality of these tests since they can be unrealistic, buggy scenarios for our end users.

The build-up to an incident

During the 4th quarter of 2022, we saw engineer reports about failing test runs in our biggest app, unrelated to their changes. It wasn’t the same tests failing every time, so it was hard to isolate the problem.

Initially, the impact was negligible. Engineers would encounter a flaky test, and retry until it passed. We all want to ship on time, so we also introduced a ‘replaying until it passes’ culture.

Let’s fast forward to the end of March 2023:

A snapshot of Slack messages of our team member reporting test failures — A collection of engineers reporting issues over the course of a day

Our test runner reported a similar error in most scenarios:

To understand the impact, we wrote a script to fetch all failed Gitlab pipelines where a test job failed because of this.

What we thought was a minor inconvenience turned out to be a major blocker for our team.

Our team invested a lot of time finding potential root causes and applying fixes. But after 2 weeks, we still didn’t see a drop in the number of failures.

While we were trying to resolve the root causes, we weren’t fixing the problem for the team. So my manager decided it was time for a different approach:

Mitigation time

As a mitigation target, we were aiming for zero occurrences of test failure on our ‘master’ pipeline. Since all our work is linked to that branch, not a single engineer should have tests failing unrelated to their work.

We use ember-exam to have more control over how we run our tests in our CI. For most of our apps, we configure our test runner to split the test equally across multiple browser instances, by passing the --parallel=<num> and --load-balance options.

Beforehand, we experimented with splitting our test scenarios across multiple runners. With fewer tests needing to be run, there’s less risk of an instance running into issues — as with our other apps. We split the test jobs across 10 instances, reducing the number of tests per job to ~1200.

So far, we haven’t observed a single occurrence where the test runner runs into similar issues related to the same error. 🎉

The immediate aftereffects

However, the mitigation didn’t come without unexpected side effects, issues, and necessary improvements. Let’s highlight a couple of interesting cases:

Running tests that were less deterministic uncovered a lot of flaky tests

Since we introduced an arbitrary split (total amount divided by 10), some scenarios were running in one job, while others were running in a different job.

We discovered that many tests seemed to run well sequentially, but were flaky in other cases.

Switching from 2 to 10 jobs created massive queues

Every job itself consumes a runner, which is an AWS instance. From one day to the next, we needed 5 times the amount of runners for our test jobs. This resulted in a big queue with wait times of up to 1 hour. To mitigate this, we initially increased the number of runners.

Given the cost, this was not a long-term solution. But thanks to our SRE team’s awesome work, we were able to use lighter and cheaper EC2 instances to increase the number of runners and reduce the queue, while decreasing the cost by around $75,000 a year!

Setting up the right alerting

Going forward, we had to make sure engineers didn’t reintroduce flakiness inside our test runs. As a solution, we set up an alert that sends a message to our teams’ code-review channel to address the problem as it occurs.

Wrapping things up

After a month or so of mitigating, monitoring, tweaking, and improving, we can proudly say our CI now looks better than ever. Let’s wrap it up by sharing some key learnings and our overall improvements.

Fix the problem before implementing a permanent solution

When we stopped focusing on solving the root cause and instead doubled down on fixing the problem, we saw direct results in a couple of days. With the problem fixed, we were able to focus on resolving a variety of root causes.

Not every problem is worth solving

We solved a bunch of underlying root causes through our investigations, making our test runner perform better. We still have memory leaks within our test runs that we decided not to address. Taking into account the time it would’ve taken to resolve them, it has been more cost-effective to reduce the number of tests run in one instance — so we can keep scaling.

Start tracking your CI and alert when things fail

We learned from our mistake and are now tracking pipeline durations, test job runs, queues, and unexpected failures. With the right alerting, we can make sure we don’t introduce new issues.

Avoid becoming the team that can’t release

Having experienced it twice, this one is dear to my heart. It’s all too easy for the ‘replay until it works’ culture to creep in. With the right monitoring and alerts, we can help our team to deliver quality software — fast. We can and will prevent this issue from happening again!

About Qonto

Qonto is a finance solution designed for SMEs and freelancers founded in 2016 by Steve Anavi and Alexandre Prot. Since our launch in July 2017, Qonto has made business financing easy for more than 350,000 companies.

Business owners save time thanks to Qonto’s streamlined account set-up, an intuitive day-to-day user experience with unlimited transaction history, accounting exports, and a practical expense management feature.

They stay in control while being able to give their teams more autonomy via real-time notifications and a user-rights management system.

They benefit from improved cash-flow visibility by means of smart dashboards, transaction auto-tagging, and cash-flow monitoring tools.

They also enjoy stellar customer support at a fair and transparent price.

Interested in joining a challenging and game-changing company? Consult our job offers!

Avoid becoming the team that can’t release

Our release strategies and CI

The team that couldn’t release

The build-up to an incident

Mitigation time

The immediate aftereffects

Running tests that were less deterministic uncovered a lot of flaky tests

Switching from 2 to 10 jobs created massive queues

Setting up the right alerting

Wrapping things up

Fix the problem before implementing a permanent solution

Not every problem is worth solving

Start tracking your CI and alert when things fail

Avoid becoming the team that can’t release

About Qonto

Written by Steef Janssen