Reducing cost and latency of change for legacy services

Ministry of Justice Digital & Technology
Just Tech
Published in
6 min readJun 17, 2019

by James Abley (Technical Architecture Profession)

Context

Like a lot of parts of government, the Ministry of Justice has a large technology estate. Ours includes many things which pre-date GDS spend controls. How government bought things has changed over the last 6 years. But we still have things from before that time. In this article, I’m calling those things legacy services.

Many of these legacy services are business-critical core systems without which we could not operate. It would cause massive amounts of pain/frustration if they weren’t available. Our CTO Dave Rogers has written about some of the problems faced when trying to transform these ageing services.

Delivering substantive changes to services can take time. But just because something is hard doesn’t mean it’s not worth doing. And if you do it right, you can end up in a much better place, making continuous small improvements.

Our problems

Some of the properties of these systems are less than desirable. For example, changes are very infrequent. We have services that are only updated once per year. These changes are risky affairs. Downtime with weekend work is the norm. Then there is usually the unofficially accepted post-release cleanup period. This involves further unplanned releases to address problems introduced by the planned release.

Infrequent releases are not great for the people using the system who may be experiencing frustration. Batching up fixes in this way means we are delaying improving people’s quality of life. That’s quite a blunt way of putting it; it’s meant to be.

So the cost of change is too large. Changing the services involved the work of many people over a long period of time.

The book Accelerate by Dr Nicole Forsgren, Jez Humble, and Gene Kim contains a wealth of evidence (science FTW!) about high-performing technology organisations. It talks about 4 key metrics that are useful to measure when trying to understand Software Delivery Performance:

  1. Lead time — the time it takes from a customer making a request to the request being fulfilled
  2. Deployment Frequency — this is a proxy for batch size, which is easier to measure and typically has low variability
  3. Mean Time to Restore (MTTR) — when a service incident occurs, how quickly is service restored?
  4. Change Fail Percentage — what percentage of changes to production fail?

If we’re doing one release a year, we can see that the average lead time is 6 months. Deployment frequency is annual. The mean time to restore was a week, but sometimes fixes were expedited and done mid-week rather than waiting for another weekend to do a release.

And the Change Fail Percentage was very high. Maybe not quite 100%, but not far off.

Our journey

We recently migrated all the services for the Legal Aid Agency from traditional hosting. A business case had established that moving some of the services to the public cloud would give us financial benefits.

But we also managed to eke out some other benefits as part of the migration. We have made huge improvements to the metrics mentioned above.

We had some guiding principles for the migration to public cloud:

  1. We will move out of the current data centre
  2. We will not provide the service in a way that is any worse than the current service
  3. Applications delivered into public cloud will have a stable, automated deployment pipeline
  4. Applications in public cloud should capable of redeployment during normal working hours

These were enough to get the organisation (and our users!) to a much better place:

  • We have got the lead time to the same day, for small changes/fixes.
  • The deployment frequency averaged around every 3 days* (previously it was once a year).
  • Mean Time To Restore is currently untested, but the lead time and deployment frequency numbers show that it would be resolved on the same day.
  • Change Fail Percentage has dropped to 0%.

How we did this

Deployment pipeline

A deployment pipeline is an automated manifestation of your process for getting changes into an environment. Adopting a deployment pipeline means that all changes have to go through version-control. A change happens in version-control and that starts a build. This enhancement to the system in how people worked ensured visibility of changes.

The application/service gets built once and then deployed everywhere with the relevant configuration for that environment. This gave a nice separation between things that tend to have different rates of change — the application code, and the configuration for each environment.

Trunk-based development

Trunk-based development has helped:

  • ensure small batch sizes
  • increase the frequency of deployments
  • reduce the risk of deployments
  • reduce the time spent merging long-lived branches.

And we can try to break down work so that useful increments can be delivered regularly.

Stabilisation periods are no longer necessary. Time spent doing difficult merges of long-lived branches that have diverged a long way from the main trunk branch is no longer needed.

Adding automated tests

Many of the legacy services were created when Test-Driven Development and other practices were not as widespread. They do not have a vast suite of automated tests with a high level of coverage which give us the confidence to know we aren’t breaking things.

The most valuable investment we decided we could make was adding functional tests which exercised the way our users would typically use the applications. If we could replicate the happy path of how people used the system, we could manage the risk that:

  1. The migration would not break anything
  2. We could continue to improve and fix things post-migration

These sorts of tests are slower to execute, since they typically need a few distributed components, but they gave us the best level of coverage when working with code bases that had not been designed to be testable.

As part of the deployment pipeline, these tests are a quality gate that runs for every potential change which might go in front of our users.

Eliminating waste

From a Lean sense, we have also eliminated various wasteful activities.

When we had long-lived branches, we used to have a different environment per-branch. People had to track which version was in which environment.

Now that we’re doing trunk-based development with the ability to deploy to production on the same day, we don’t need to do that. If someone asks what version is in environment X, we always know:

  1. The deployment pipeline keeps track of it now, rather than a person, so check what the deployment pipeline says is deployed in environment X
  2. But mostly the answer will be “trunk”

We have also got rid of several environments; we no longer need to pay for them, or manage them, due to our new ways of working.

Conclusion

We started the migration wanting to realise the cost-savings available to us from using public cloud. In truth, we know we are still very early on in the journey for this part of the organisation. And it’s not evenly distributed. Some of our colleagues have to work with bits that are still harder than they should be to change and improve.

But we have achieved lots of small improvements. Small batch sizes:

  • reduce risk
  • give opportunity for faster feedback and learning opportunities
  • are faster to deliver value to the organisation and our users

That is such a massive difference for the organisation, and something we should be immensely proud of.

*It was quite high for a while because, for the first time, we had better insight about how the systems were running in production. We could access the logs and see the graphs. We could see the unreported frustrations that our users were tolerating or working around every day. So we fixed them. For most of the teams now, it’s a capability that they can choose to exercise as part of their normal sprint cadence (typically fortnightly).

If you enjoyed this article, please feel free to hit the👏 clap button and leave a response below. You also can follow us on Twitter, read our other blog or check us out on LinkedIn.

If you’d like to come and work with us, please check current vacancies on our job board!

--

--

Ministry of Justice Digital & Technology
Just Tech

We design, build and support user-centred digital and technology services for the justice system.