Matching the pace of delivery with product demands
In the context of a business that is driven by telling the world’s stories as and when they happen, there is an expectation that its technology should move at a similar pace. When that happens, you typically accumulate technical debt at a corresponding rate. Over time, this slows down the pace of delivery, impedes innovation and increases bugs. So finding an approach that meets product demands and maintains software integrity is hard.
This was exactly the problem we had on The Sun. For two years we motored, trying to deliver features and keep up with our competitors in the digital publishing space. Eventually what we ended up with was a product that we had low confidence in delivering, almost no automated tests and a build pipeline so fragile that it would take anywhere up to three days to get into production.
This meant it was taking technology longer to deliver, which in turn compounds cost and effectively reduces the amount of value being delivered back to the business.
Stop, look and listen!
To find out how we got there and how to improve things, the engineering team did a retrospective on the current state of health of our technology stack.
We spent a lot of time analysing our current state of health including:
- Measuring and monitoring build times
- The time for each build step
- The time to build and deploy to each environment
- The overall time from developer machine to production
- Conducting engineering lead retrospectives to discuss the good and bad parts of our system
- Looking at historical and current feature initiatives and identifying blocking trends that were preventing us from moving faster
- Attributing a cost (as hard as this is) to technical debt
The key conclusions we drew were:
- The build pipeline was fragile, so it was taking progressively longer to deploy code to all our environments.
- A lack of automated end to end tests meant that engineers spent a long time doing manual regression tests, slowing down delivery time.
- Technical debt that was causing bugs diverted our efforts away from innovation and on to fixing bugs.
But there was some good stuff too..
It’s also important to note what was good:
- Our components had an excellent amount of unit test coverage.
- The engineers had a lot of passion and excitement working on a platform with such a large scale of traffic and interesting problems to solve.
- We were still — just about — able to meet product demands.
Convincing the business
Having built a solid picture of where we were, we presented it to the product team and stakeholders. We needed their buy in so they could agree we needed to make changes in order to keep innovating at speed.
This also helps us facilitate changes not only in the way the business works with technology but improves the culture and overall happiness of our engineering teams as well as being able to deliver a high quality product at pace with confidence.
It was clear that if we did not make things better we would not only prevent the business from achieving their goals, but our engineering talent would become unhappy, unmotivated and ultimately leave.
Make some easy rapid decisions
We made a few decisions really quickly:
- Invest in maximising our automation test coverage to remove the need to run through manual regression packs every time we release.
- In line with an overall engineering strategy at News UK, migrate our build pipeline from Team City to Jenkins.
- Put in place a strategy to chip away at our technical debt regularly.
Set some KPIs
To help us measure progress we set ourselves some KPI’s:
- Release to production at least once per day
- Reduce build time from developer machine to production to less than two hours.
- Zero bugs introduced into production.
Build a plan
The only way to move fast and with confidence is to automate as much as possible and put in the right quality gates. This means automating any step in the development, testing and deployment process that is currently manual.
Automate manual regression
We started by focusing on automating our manual regression test pack, which covered testing scenarios on a variety of browsers and devices. This saved time as this was an intensive, manual, time consuming process.
Another huge pain point was that broken pull requests would get often get merged to master because there was insufficient testing done on pull requests.
Our principal engineer Marie Drake has written an in-depth article on how we improved quality at The Sun which you can read here.
The challenge in most engineering teams is balancing resources, trying to continue to deliver value to the business whilst also focusing effort on technical improvements. There are a couple of ways to go about this:
- Reduce innovation: reprioritise your roadmap to allow for time and resource to be allocated to technical improvements. This does mean less feature development but a more robust platform going forwards.
- Add resources to increase fix rate: invest in extra short term resources to deliver technical improvements.
We decided to invest in extra resource to ensure we could still meet our KPI’s, knowing very well that an important part of this would be to up-skill current team members and share the knowledge of what was being done across the team to bring about long term change, not just short term relief.
Rebuild the delivery pipeline
A core part of being able to work as an efficient engineering team is to have a reliable, well maintained build pipeline that validates your code and business rules before it gets deployed to production.
I can remember quite vividly a conversation with the engineering team during one of the first releases that I was part of which went something like this:
Engineer 1: “We have an issue with the release”
Engineer 2: “Let’s roll it back”
Engineer 1: “We can’t, the rollback doesn’t work”
This conversation came about because of a lack of automated testing, a large manual testing regression pack and a fragile delivery pipeline. Our build pipeline was the culmination of two years of technical debt that had become so complex people were afraid to change it. This is a dangerous state to get into. It can lead to costly mistakes and it’s not great for morale. With the pipeline in this state, we were being held hostage by our own tools and this prevented us from being able to deliver at pace.
Migrate to Jenkins
We decided to migrate from TeamCity to Jenkins (in line with our overall engineering strategy) with a focus on scripting absolutely everything so that the amount of manual intervention was minimal.
We opted to write our pipeline steps to run bash scripts, which made the process agnostic and did not tie ourselves down to Groovy. Engineers from across the team could pick up the scripts and contribute, thus driving productivity and knowledge sharing.
The other key change we made was to move to a mono-repo. A mono-repo is a repository that stores code for multiple projects. Our engineers constantly worked across two interdependent projects and this proved challenging when developing and releasing. The mono-repo also allowed us to set conventions, standards and best practices across multiple projects and geographically dispersed teams.
We also adopted Lerna in order to manage our multi-package repository. Lerna helped the following:
- Detect which application package has changed
- Build the application package that has changed
- Push the package to a repository
- Deploy the package
Moving the entire Sun software landscape into the Mono-repo took place over the course of six months and was a collaboration between engineering and our operations team.
The efforts and investment here enabled us to release multiple times a day versus two to three times a week!
Deal with technical debt on a constant basis
Almost all software that is built has some level of technical debt. Technical debt is typically code that has grown complex, unmanageable and unmaintainable over time and is hindering the development of new and existing features.
The amount of technical debt you have and where the technical debt sits in your codebase will have an impact on your delivery timelines. Developers will usually overcompensate to account for time when dealing with sections of the codebase that are complex. This can lead to longer than expected builds and in some cases can end up extremely costly.
Our codebase had passed through several pairs of hands over time. When this happens, a team of developers have built the functionality a certain way, then another team of developers proceed to rewrite it in a slightly different way without fully understanding the ramifications of their changes, leaving behind a mish mash of old and new code that don’t really work well together.
This, together with a lack of good unit testing, compounds over time to the point that you end with spaghetti code that your current engineers cannot understand, debug or extend.
The 20% rule
To unravel this we put a basic rule in place:
- Teams will spend twenty percent of their sprint per week on technical debt tickets.
This was sold to the business as a valuable piece of work required in order to unlock the ability to develop features faster. More importantly it was a culture change for the product and engineering teams. By dealing with technical debt, engineers felt empowered to improve the quality of the product as they now had the time and focus to deal with technical debt that had frustrated them for months. Product managers agreed that in order to be able to deliver a high quality product and deliver features faster to the business we must focus on dealing with technical debt.
To help us explain the problem to the business, the team spent a number of hours building a technical debt backlog that was prioritised by items that were impacting the development experience the most. By the end of these sessions, the team had identified thirty to forty tickets. Our goal was to use our twenty percent tech debt time to knock these on the head as fast as possible.
Below is an example of some of the work we undertook:
- Refactoring complex functions using a Test Driven Development approach to ensure the code was of high quality and stable.
- Removing legacy code that was not in use but still kicking around in the codebase.
Let’s be honest, none of this was plain sailing. Every organisation is working towards improving their products constantly and being a market leader in their area. Technology plays such a large part of these objectives and if the technology isn’t managed correctly, this will have a negative impact on the ability of a business to grow and can in the worst case, lead to failure.
Where are we now?
The engineering team continue to evolve having put in place good engineering standards and practices as well as culturally changing the way we work.
Based on the KPI’s that we set out we achieved the following:
- Release to production at least once per day ✓
- Reduce build time to production from developer machine to < 2 hours ✓
- Zero bugs introduced into production ✓
- Engineers can maintain the build pipeline ✓
By making the changes above we have improved the following:
- Our ability to deliver business value faster
- Improved the developer experience which keeps engineers happy
- Onboarding new engineers has improved, the code is less complex, easy to read and understand
- Confidence when releasing is at an all time high
- Reduced the manual testing effort down to a minimal
- Integration testing is kept up to date and automated as much as possible
We continue to work and educate our teams on DevOps principles and practices, chip away at any remaining technical debt as well as focusing on site performance.