Do you suffer from the Staging Bottleneck Problem?

After we finished our YCombinator (Winter 19) program, I’ve been spending a good deal of time talking to CTOs and engineers in various companies trying to understand their development workflows. Most teams suffer from a common problem, even though they have standard CI/CD pipelines in place — they usually have a fixed number of staging environments, and this becomes a bottleneck and slows them down. Let’s see the symptoms and effects of this problem and how to tell if your team is already losing precious time because of this.

Look at your project tracking board for symptoms

Open Trello, Jira etc. Do most of the user stories get accumulated in one list/column on your board by the end of a sprint? Is the pile-up in a column
that doesn’t indicate “Done”?

One of our users reported that 80% of their user stories would be pending approval by end of every sprint even though developers were done with coding. Another user with ~20 developers could only ship five stories every sprint. This became fifteen after they removed the staging bottleneck using Dockup.

Reviewing a change needs a context switch and people tend to optimize for minimal context switches. This means if staging environments are not available when reviewers are ready to verify a change, they park it for later and move on. This means staging environments should be ready when reviewers are ready, or in other words, teams need on-demand staging
environments. When teams have only one staging environment, pull requests hang around for too long or user stories pile up in the “pending approval” state.

Another effect of the bottleneck is that feedback gets delayed and when there are dependent changes all waiting for approval, a simple feedback can result in the rework of all the dependent changes that were already developed.

Search for “staging” in your Slack team

Do you see many messages like this? This is a common sight when teams only have one staging environment. Before developers can deploy their changes for testing, they need to acquire a lock on staging and release it once they are done — usually by announcing in their chatrooms. This results in hours of developer time wasted everyday as teams become bigger and spread across different timezones.

Engineers hate repetitive manual work. Asking for staging access,
preparing the environment, which sometimes has dependencies on devops, rolling back dirty test data and finally deploying their code using scripts can easily become frustrating and even make developers quit their jobs, as I learned from user interviews!

Solution

The easiest solution to this problem is to have enough staging environments so you never run out of them. One user used to spend $500/month on their staging infrastructure. When their team grew and they realized their staging had become a bottleneck, they tried to solve this the easiest way possible — by getting more staging servers and increasing their monthly budget to $2500. This is not only expensive, but also does not scale with team size.

The solution is to create staging environments with these properties:

  1. On-demand — create them when reviewers are ready and delete them after use.
  2. Automated — code changes should trigger deployments automatically and cleanup should happen automatically as well.
  3. Immutable — data changes from one test should not affect other tests.
  4. Integrated with your tools — it should be easy to trigger deployments from other CI/CD tools that the team already uses. It should also be possible to send deployment statuses to Github PRs and Slack channels so the rest of the team can easily access these environments. It will be nice if it could also notify webhook endpoints to run automated tests on the newly created environments.
  5. Flexible — it should be easy to create new staging environments when you change your stack, for example, when you add a new microservice or introduce a queue etc.

We built Dockup with these ideas and if you’re looking to remove the staging bottleneck in your team, you should check it out.