How to issue a monolith that 150+ developers from different offices contribute to, and do it effectively

Lyudmila Maleeva
Miro Engineering
Published in
8 min readJan 11, 2021

I work as a Software Engineer at Miro in the team responsible for release process improvements. In the last year, we’ve opened a new office in Amsterdam, the engineering team has doubled its size, and six months ago our company temporarily switched to remote work. At the same time, there has been a constant multiplied growth in our product’s number of users.

Against the backdrop of these changes, it was important for us to retain quality and speed, so we revamped the server release process with a few key changes that ultimately increased our success rate.

Server releases

Our backend is a monolithic Java application that can be run with different roles to perform different tasks. We use AWS instances (quad-core CPU, 16 GB RAM) to run it. Most of our backend servers are used for the application that keeps an active web socket connection to a client so that users can always see the actual state of the boards in Miro. We use the Board server role for these servers; users access them as they work on the boards. For business logic and API requests, we use the API server role.

We make our releases seamless via graceful deployment, which we try to perform with the lowest load on the service. On average, we have 60,000 online users and 50 running board servers during a scheduled release.

We consider a release successful if it happens on time and contains all the resolved issues that were ready for release at the time of its launch. Respectively, a release is considered unsuccessful if something goes wrong since bugs that require stopping or rolling back the release increase time to market.

We evaluate any changes in our release process based on how close they bring us to a successful release.

A successful release is a release that is shipped on schedule and includes all resolved issues that were ready for release at the time of its launch.

Release workflow:

  1. We’ve made a tool that analyzes modified files in a pull request and chooses which end-to-end tests are to be executed based on the mapping of automated tests to the actual product code. This way, all relevant tests have to be passed successfully to merge changes into the master branch.
  2. Each master branch is subjected to a full regression test. The release is possible after all tests are successful. Unpassed tests are sorted out by the teams responsible for the functionality that failed.
  3. We use Allure Enterprise Edition and mark the false positive tests as Resolved to do automatic releases.

Release process:

  1. Find a build with 100 percent successful tests and a larger version than the current version in production.
  2. Launch a canary release.
  3. Monitor release metrics for four hours.
  4. Set the release status to Approved or Broken once the canary release is over. The Approve status will lead to an automatic release launch the next morning; the Broken status won’t.
  5. To release on API- and board-servers, we create instances with the new version. The number of instances is calculated based on the current load plus 20 percent to prevent high load during or immediately after the release.
  6. As users gradually move to new servers, we turn off and remove the old ones.

The release takes an hour and a half from creating the instances to the new version’s complete transition.

Canary release

A canary release is designed so that changes can be validated by a small, random subset of users. We bring up several servers with the new version of it and monitor the situation. If everything goes well on the canary release, we release these changes to all servers.

Canary release process

A canary release is not a way to test new code in production; it is meant to provide an additional layer of protection for newly finished code. It allows you to reduce the number of users who may encounter a bug that would only be seen in complex cases or one that could only be reproduced on the production infrastructure.

To quickly respond to bugs in the canary release, we introduced the role of server developer on duty, which is fulfilled by each developer in turn. The on-duty developer reacts to new bugs in Sentry and general warnings from Grafana during the four hours of a canary release and can stop the release manually if necessary. After a canary release is complete, they update the release entity’s status in Bamboo as “Approved” or “Broken.”

In case of urgent releases outside of the schedule, teams can manually run a release through deployment in Bamboo. Each team has engineers with the corresponding permissions to do so.

Users receive the canary release through random selection using the load balancer. Such a random sample allows us to validate releases via different users but also has its disadvantages: it doesn’t let us balance user types and accounts without changing the code, nor does it allow us to check functionality on specific accounts or boards.

We can only roll out a canary release to a certain subset of users if the functionality was written using a feature toggle, implemented through code, not releases.

Hot Fix in a canary release

Previously, when we found a bug in a canary release, we would block the merge to the master as well as the entire release. This was inconvenient, as it blocked the work of other teams and delayed the release schedule.

We wanted to find an approach where we could minimize these delays. We studied the existing approaches (Trunk-Based Development, GitFlow, etc.) and chose GitLab Flow.

How we work with Hot Fix by GitLab Flow:

  1. Create release branches from the canary release version.
  2. Merge the fix to the master branch.
  3. Perform `git cherry-pick` in the release branch.
  4. Launch the canary release at the release branch.
  5. Launch the next planned canary release at the master branch with the version of the fix or higher.

This approach helped us halve the maximum number of non-release days and the number of restarts for canary releases from four to two.

Predictability and transparency of the release process

To improve the quality of a release, we maintain its transparency and predictability by using automatic notifications and dashboards with key metrics.

Previously, we would publish a single large changelog in the general channel for all teams that included all changes of the release. These were difficult and painful for teams to navigate through. Therefore, we added team changelogs to the general changelog so that each team can see the statuses for their tasks alone and the release version in which these tasks were implemented.

To validate unscheduled releases quickly, we use dashboards in Grafana. During scheduled releases, we have enough alerts from Grafana based on metrics from Prometheus.

We use Looker to collect and visualize all release statistics from Jira and Bamboo to make decisions about the quality of processes based on historical data and improve these processes.

Data on bugs, the number of created and closed tasks

We are currently implementing a feature that allows teams to block manual and automatic releases if there is a bug in the master. This will allow us to automatically collect statistics about how many masters were broken and how long it took to fix them, and understand which bugs blocked the release.

Changes that have increased the success rate of our releases

  1. Canary releases have helped reduce the number of release rollbacks by 95%.
  2. Separate changelogs for each team have increased overall process transparency. Now each team is notified on time and in a convenient way when their functionality is released.
  3. Monitoring of the canary release by the server’s on-duty developer has reduced the team’s response time to found bugs.
  4. The GitLab Flow hotfix approach has allowed us to delay the release as little as possible and to fix bugs without blocking the work of other commands. Automatic releases encourage teams to keep the master branch always ready for release.
  5. Collecting and analyzing the entire release history in Looker helps us test hypotheses and continually improve the process.

What’s next

Our ultimate goal is to build such a good process that all our releases will be successful, and users will never encounter any problems. To achieve this, we are planning the following changes.

  1. Splitting the monolith into microservices. We start moving in this direction, but this is a separate large project outside the scope of the article, so I will not go into details.
  2. Increasing the release speed. Right now, it takes an hour to release on our board servers and about half an hour to release on API servers. We want this to be faster.
  3. Giving our teams a tool to manage releases on their own. It is currently possible to launch a canary hotfix release, but teams cannot use GitLab Flow entirely on their own. For example, they cannot create a release branch. We have the “Branch merging enabled” feature set by default, so branches also contain code from the master during the build, and teams are going to need external help to disable this feature for release branches manually.
  4. Reducing the time from the appearance of a bug to its fix in the canary release. Right now, it can take up to six hours of working time to fix a bug in the worst cases due to difficulties in communications or processes.
  5. Managing the load on canary releases so that we’ll have the opportunity to increase the speed of the release run without changing the fraction of users participating in it as our userbase grows.
  6. Adding custom metrics to release validation. For now, we only use technical metrics and metrics with bugs.

I would be glad to hear about your experiences with increasing your releases’ success rate in the comments, especially if you have already surmounted the above-mentioned challenges.

Join our team!

Would you like to be an Engineer at Miro? Check out opportunities to join the Engineering team.

--

--