DevOps: Small Changes, Big Impact

Published in

Accurx

9 min readFeb 24, 2023

We’re big believers in regular retrospectives here at Accurx. We think they’re a great way to learn from past experiences to inform better outcomes in the future.

Having joined Accurx’s fledgling DevOps function in November 2020, I’m lucky to have lots to ‘retro’ on. A recent talk I gave at DevOps Exchange London had me reflecting on the past couple of years of improving our DevOps practice, and in particular on the smaller interventions we were able to make that had an outsized impact in helping our engineers serve patients and healthcare staff.

*The Platform team at our Away Day at the National Computing Museum, circa October 2021*

Each of these interventions originated from user research that we conducted to arrive at a vision for engineering productivity and system resiliency at Accurx, and I’ll briefly take you through that process next, before we dive in.

Making time to listen to our users

It can prove difficult when joining a company with a clear preference for action, to slow the hands-on work down and trust that listening to users is worth doing. If there’s one thing you take away from this post, it’s that this listening is paramount. If one thinks of the target for “devops improvement” as an experiment, trampling all over the experiment with one’s ideas before having measured anything is very likely to muddy the waters, making it difficult to prove value.

Asking the questions was a little bit like turning on a firehose — people had so much to say. But this was a very positive development in my book as it showed that people cared and trusted us to get things done.

*The output from our first round of user research*

*Our initial areas of focus for 2021, outside the interventions I’ll talk about*

After a fortnight of user research, we had a big old list of problems to tackle. The first step was to pull these into a set of aspirational outcomes which could help guide us. These outcomes not only steered and gave clear markers of where we should focus our efforts but helped us set out our grand vision for DevOps in its early stages.

While coming up with this aspirational list of relatively high effort items, we saw that there were some quick wins we could deliver.

Low effort, high return investments

So, here are the three interventions we were able to make, that had an out-sized impact for the amount of time and effort put in.

1. Shorten feedback loops

Previewing changes before they hit trunk

Our product teams consist of a cross-functional mix of skills across engineering, product management, design and user research. Something I’ve been consistently impressed by in the product teams is how closely engineering works with other functions within a team. This includes everything from attending GP practice visits together to helping product management to quickly validate hypotheses about user needs.

One of the main complaints we heard in user research regarding this process was the lack of support in previewing changes before committing them to the trunk branch and releasing to the demo environment for feedback. We saw a couple of hacky solutions being used, such as running a frontend and services locally and asking folks to connect to a random IP on the internal network, which changed a lot and required the engineer to keep their machine up.

Having recently moved to Kubernetes, a container orchestration platform that has allowed us to standardise hosting and deployment patterns across our applications, it became trivial for us to support this workflow natively in our build pipelines. We released “branch previews”, where an engineer is able to supply a custom hostname prefix, e.g. “experiment-1”, and have their code deployed to our development environment, e.g. “experiment-1-web.dev.accurx.com”.

We were reasonably confident the feature would be well used given people were hacking around the lack of it. It was nice to see success in metrics though, and we saw close to eighty preview deployments in a week. That’s a lot for a team of (at the time) five frontend engineers.

Though there’s an argument for using an internal developer platform for something like this, a custom implementation leaves us free to experiment with more deployment strategies, such as canary deployments with automatic rollbacks based on application specific logic.

Some tips:

Understand where the most painful part of your product validation process is first, before making improvements.
Run it by security, you’re essentially allowing people to deploy whatever they want from a branch to a domain that you control.
Ensure you have a plan for automatically removing stale deployments lest your resource usage goes through the roof!

2. Make sure they build it, and they run it

This one got a few nervous laughs when I mentioned it at DevOps Exchange, but I promise it’s worked well for us.

2a. When building infrastructure

*Organising infrastructure-as-code to encourage engineers to take ownership*

My earliest memory of infrastructure as code (IaC) at Accurx was that, thanks to a prescient engineer and our CTO, we had comprehensively captured and were able to replay large parts of our infrastructure. However, we saw that new infrastructure was sometimes missed, and that owning teams weren’t engaging with this code due to its complexity.

To help, we slowed our pace of infrastructure development down when needed (e.g. after incidents) to capture any ad-hoc changes that had been made. We also re-organised the code along product and service lines to create a more familiar environment for our product engineers to engage with. We found with these changes increased engagement, and engineers started thinking about infrastructure changes as a code-first activity.

Recently, we were able to replace all our production gateways and clusters in a couple of pipeline steps and with a half day of work, proving that this has been a great investment.

Some tips:

Split infrastructure by product area, and along existing lines, so that engineers feel comfortable making changes.
Work with one or two key engineers in the early days so that they evangelise infrastructure as code to the rest of the engineers.

2b. When building product

*An early declaration of a culture of DevOps*

Early on in Accurx’s history, engineering leadership ensured that product teams agreed to take on every aspect of ownership of their applications. On top of this, DevOps teams should act as satellite teams providing guidance to product teams to help resolve issues or try new technology.

This can definitely go too far the wrong way. I’ve heard a few horror stories about companies ending up with a ton of redundant pipelines doing the same things in slightly different and very insecure ways. So be careful and ensure that the team that acts as a “subject matter expert” for shipping code has built enough trust that product teams call on them first when needing something.

We’re lucky at Accurx in that our product engineering teams care deeply about both user experience and about working together with us to address issues in partnership. This allows us, when dealing with issues or trialling new technology, to be confident in delegating ownership of the task. We only take primary ownership of such tasks when not doing so would be detrimental to building and shipping products due to the additional context a product team might need to take on.

Some tips:

Ensure you’ve built trust with engineers, and that they’ll come to you before building their own (possibly redundant) improvements.
Try and take ownership of an issue as a last resort. It’s tempting to jump in and own things because we want to help, but it is frequently detrimental to the engineers you’re trying to help to do so, because the context they build on their own application and infrastructure reduces the chance of external teams being a blocker.

2c. When needing elevated privileges

As a former security architect, seeing the command above would have set some alarm bells ringing. But I think engineers often forget that good security practices can actually make workflows easier rather than harder.

We were lucky in that our cloud provider released a preview feature with native support for granting temporary privileges just as we started to feel this pain due to hiring more engineers in early 2021. We agreed with our security team that we could start with a small group of senior engineers approving these privileges. This ended up being more secure than the practice at the time of moving engineers into an administrative group (and inevitably forgetting them in there).

We’ve now enlarged the group to senior engineers most likely to have context on privilege elevation requests within their teams, and so far this has been working well. Perhaps even too well — before making this change, we were worried that making it too easy for engineers to elevate privileges might mean we end up doing lots of things manually, never writing down procedures or automating them. So as a compromise, we ensure engineers — and approvers — provide lots of context when using this workflow, which we hope to go back over in time to automate some of the more dangerous actions.

Some tips:

Every action taken leaves an audit trail.
Engineers should not be able to approve their own requests.
Engineers and approvers should both provide a reason for needing access.

3. Hold blameless postmortems

A lot of the credit for this one must go to the first few engineers at Accurx, who set up a blameless culture and process around learning from incidents early on. Very rarely are specific people called on, and we instead refer to specific commits or combinations of commits as having caused issues. This culture has thankfully persisted as we’ve scaled to more teams and more engineers.

*Can an increasing number of incidents over time be a good thing?*

As a Devops Engineer, looking at the incident chart above without any context would start to get me worried about an increasing number of incidents. But with context, I can assure you that an increase in reported incidents has largely been a positive outcome for us. A lot of the incidents would have previously been solved in team channels without the knowledge of people outside the team. But thanks to our work culture, product teams know they can ask for help to fix things. This means that our incidents tend to last for less time and have clearer processes around them that new engineers can follow successfully.

But measuring all this is extremely important, and I’m happy to plug our friends over at incident.io, who make it super easy to collaborate on incidents and measure them. Lately, we’ve found that the initial “issue thread” is just a few messages and a lot of the conversation happens in dedicated incident channels.

A retro on this retro

Anec-data-lly, a learning

You may have noticed a distinct lack of data in this post to support any of my hypotheses, and if there’s one thing I’d want to tell me-from-two-years-ago, it’s to measure twice and cut once.

This is difficult to do, of course, else we would have done it, especially when getting healthcare communication products into the hands of our users during a pandemic . But I wish we’d taken a leaf out of our product managers’ books and started to collect and preserve data to more accurately say whether the interventions we were making were having the desired effects.

Tell us what helped you

Hopefully you’ve found this post helpful and can take away some small changes you can make to disproportionately improve the lives of your engineers, or at least know that the big tickets aren’t always the best ones to do first.

A lot of DevOps literature tends to be written as a “here’s what you should do”, rather than “here’s what we did”, and I think it can build a lot of confidence for your fellow engineers to hear about others’ experiences. So please share as we’ve done here, we’d love to hear what worked and didn’t for you!

Join us!

We’re always looking for great engineers for our growing teams, so if you’re interested in joining our engineering community, head on over to our careers page for the latest opportunities!

DevOps: Small Changes, Big Impact

Making time to listen to our users

Low effort, high return investments

1. Shorten feedback loops

2. Make sure they build it, and they run it

3. Hold blameless postmortems

A retro on this retro

Anec-data-lly, a learning

Tell us what helped you

Join us!

Written by Bharat Reddy