At Groupon, our engineering teams generally work with a 2-week Sprint scrum model, but we’ve had some adjustments in our history. For example, the team I work with had worked with a 1-week Sprint before, and we faced a lot of challenges while moving from a 2-week Sprints to a 1-week Sprints, and fortunately, we figured them out pretty well eventually.
I will share with you the story of how our team moved from 2-week Sprints to 1-week Sprints, what challenges we encountered, and how we eventually solved them.
How do we work with the 2-week Sprints?
Let me give you a brief overview of how we work with 2-week Sprints, the entire development process is shown in the figure below.
- Sprint Planning
Each Sprint begins with Sprint Planning, the team gets together to decide which tickets will be pulled from our main backlog into the Sprint backlog.
Our development process is similar to GitHub Workflow. Creating a new branch before working on a ticket, then opening a Pull Request during the development process, teams reviewing the changes may have questions or comments, the pull request will be merged to the main branch once the pull request has been reviewed and the branch passes all the tests.
After a week of development and coding, we create a tag based on the latest code from the main branch and deploy it to the test environment.
QA will test the released update and any bugs found will be submitted to a ticket for tracking. The developers fix the reported bugs and the Pull Request for the bug fixes will be merged into the main branch.
- Code Freeze
As we are going to deploy the code on the main branch, the main branch will be unstable if we continue to merge the code to the main branch, so we will freeze the main branch code in the last two days of Sprint, unless there are important bug fixes, the new PRs will not be merged until next Sprint.
Once QA verifies all the changes, we will deploy the release to production at the beginning of the next Sprint.
How do we work with the 1-week Sprints?
Why did we want to change 2-week Sprints to 1-week Sprints?
The 2-week Sprints run smoothly in our project. These are the main advantages.
- We are able to work on big features within one sprint
- We have enough time to test every iteration
- The release is stable
There are also a few drawbacks to the 2-week Sprints.
- It can’t get feedback very fast.
- Too many updates for each release, so it‘s not easy to find out where the problem lies.
We decided to move to a weekly Sprint, the main reason is so that we can deliver updates more frequently and get feedback faster, which helps us to quickly validate some ideas.
The development process of 1-week Sprints
At the very beginning, the development process for 1-week Sprints was similar to the previous 2-week Sprints development process, just shortening the Dev and QA time, as shown in the figure.
The challenges encountered after changing to 1-week Sprints.
After changing to 1-week Sprints, we noticed a drop in release quality, and there were several times we found critical issues after deployment to production, then we had to roll out the deployment or deploy a new patch.
So our QA proposed: “Avoid last-minute changes” in a Sprint Retrospective meeting. The background was that our service had just been broken down in production, and the reason for the breakdown was that a developer had made some temporary changes before deploying to production.
We thought the changes were minor and would not be a problem. The teammates didn’t find any problems during Code Review, and QA didn’t find any problems with simple testing after deploying to the test environment. But it broke down the production by a corner case after deployment to production.
Could we make our service stable if we avoided last-minute changes?
“Avoid last-minute changes” is a good suggestion, so we tried our best to avoid temporary changes before deploying to production in the coming sprints, all the temporary changes require a more strict code review and QA process.
However, it’s almost impossible to completely avoid temporary changes, there are always issues that need to be fixed before release. And the service is still not stable after we try to avoid temporary changes. That means that “last-minute changes” are not the root cause of the unstable service, so what is the cause of the unstable service?
What is the root cause of the service unstable?
I didn’t have an answer to this question until a few months later, the annual “Holiday Readiness”, which is also known as the US shopping season. Groupon has a high demand for stability during the shopping season, a short period of downtime can cause huge losses.
How to ensure the stability of the service?
Based on what we’ve learned from the blood and tears of the past few years, the simplest and most effective approach is, No updates during the holiday season, except for necessary minor updates or hot-fixes!
So we have a “Soft/Hard Moratorium” strategy during the shopping season. In other words, we don’t deploy new features to the production environment during the shopping season, but only deploy necessary minor updates or hot-fixes, and a strict approval process is required to deploy.
In order to respond to the “Soft/Hard Moratorium” strategy, our team had also made some adjustments:
- First, we created a holiday branch, which only fixed bugs or necessary updates for production. All other regular development remains in the main branch.
- Second, we tested the updates in the test environment for at least one week, making sure there were no problems before deploying the production environment.
At the end of “Holiday Readiness”, I found that our service was extremely stable, although there were several updates, no service exceptions occurred. That gave me a lot of insights, I realized that there were two main reasons why the service was unstable before.
Reason 1: This is not a stable ready-to-release branch
First of all, our releases are created from the main branch, and the PRs for new features and bug fixes are merged into the main branch, which means that the main branch is always unstable.
The problem of not having a stable branch existed when we had 2-week Sprints, but it wasn’t exposed at that time since we had 1 week to test. The problem was exposed after we moved to a weekly sprint, and it led to a lot of last-minute changes since the branch was not stable.
Reason 2: Not enough time for testing
We used to have a week to test when we had 2-week Sprints, and problems would be exposed in the testing environment. We only had about 2 days to test when we changed to 1-week sprints, which was hard to fully test, so many issues were exposed after we deployed to production.
During Holiday Readiness, there was about a week of testing in the test environment before we released, so issues were fully exposed rather than found in production.
How can we improve the development process of 1-week Sprints?
It’s easy to find out how to improve the process since we have identified the root causes of the problem. So I’d proposed two suggestions for improving the current process.
Improvement 1: Each Sprint has a stable branch, which is ready to release at any time.
It’s easy to solve the problem of not having a stable branch, which is to create a new release branch for each sprint whenever the sprint finishes.
Once the release branch is created, similar to what we’ve done on the holiday branch, we only do bug fixes, no new features will be added to the release branch.
For the problem mentioned earlier about last-minute changes, only the change is urgent and necessary, it will be merged into the release branch, otherwise merge it into the main branch only, so it won’t affect the release branch.
Improvement 2: Leave enough time for testing.
A simple solution to resolve the problem of not having enough time for testing is to go back to a 2-week Sprints. But everyone is used to the pace of 1-week Sprints, especially product managers, they prefer to keep 1-week Sprints so that they can deliver new features as soon as possible.
So how to ensure enough time for testing with 1-week Sprints?
I proposed a simple and feasible solution: postpone the release of the first Sprint for one week after the Holiday Readiness.
Here’s how we did it:
- We spent a full week developing the first Sprint, then deployed it to the test environment and tested it for a week, meanwhile, we started developing the second Sprint.
- The second sprint will also be developed for a full week, meanwhile fixing the bugs in the first sprint in parallel during that week.
- The first Sprint will be tested after the second Sprint has finished development, the first Sprint version will be deployed to production, the second Sprint version will be deployed to the test environment, and the third Sprint development will begin at the same time.
The new development process of 1-week Sprints
The new development process is shown in the figure.
That means we have a whole week to test and fix bugs for each sprint, and we can still deploy one release to production every week.
Our service was stabilized instantly after that change, we no longer had to worry about serious production issues after deployment, and we rarely needed to roll back or patch after release.
The problems raised by the new process
The changes above have significantly improved the quality of our releases, but they also raised some new problems.
Problem 1: There are two sprints in parallel
We have to fix bugs from the previous Sprint while developing the current Sprint in parallel. The good news is that bug fixes are usually simple, so this parallel doesn’t cause that many problems and our team members are comfortable with this model.
Problem 2: Multiple branch management
Since every sprint has a release branch, the changes for a bug fix may have to be merged into multiple branches at the same time.
For example, we found an issue in production, assuming the production version is v1.0, then the PR for the bug fix will be merged to the release branch v1.0 which is for production, then the PR need to be merged to the release branch v1.1 which is for the test environment, and it also needs to be merged to the main branch.
It’s a real hassle, but this inconvenience is completely acceptable compared to the improvement of stability.
Problem 3: The whole development process is not easy to understand
This development process works well inside our team, but it’s not easy to understand for people outside the team. All they cared about was when their requirements would be deployed to the test environment and when they would be deployed to production.
So we developed a couple of tools to visually know which version is currently in development, and which version is in the test environment and which version is in production.
With these tools, everyone can know our current development status intuitively.
In general, although the new development process has some minor problems, it runs very well and the service is much more stable.
- The release branch provides a stable version which can be released at any time.
- One week’s testing is a good guarantee of quality.
- The frequency of weekly releases also allows us to get feedback quickly.
Recently, our team moved back to the 2-week Sprints scrum model, because we are more focused on the quality of delivery than speed, so the 2-week Sprints scrum works better for us at this stage.
In the book “The Mythical Man-Month: Essays on Software Engineering”, Brooks insists that there is no one silver bullet.
“there is no single development, in either technology or management technique, which by itself promises even one order of magnitude [tenfold] improvement within a decade in productivity, in reliability, in simplicity.”
There is no single development model can meet all the requirements in all periods, and there may be different development models for different periods. And you may face new challenges in the process of moving from one development model to another, the most important thing is to find out the root cause of the problem and find the right solution for it.