Rapid Software Development in the NHS: a response to Coronavirus

Published in

Accurx

16 min readMar 19, 2020

I work for accuRx, a company who provide software which help staff in the NHS communicate with Patients. This last 2 weeks (at time of writing) has been, it would be fair to say, unprecedented — firstly for the NHS, but also for us as a health-tech company in that space, and this is not going to be changing any time soon.

On Friday 6th of March, concerns around Coronovirus were ramping up in the UK, and it was clear that as a software provider in this space, we should act quickly to do our best to provide the kind of functionality our user-base would need to increasingly care for Patients remotely.

As a business, we were able to deliver some simple, but very effective product improvements and features in a weekend, and some bigger changes in the week which followed, deploy them, and see rapid usage & adoption with a very engaged user-base.

This short piece is less about the details of what those improvements were, but more about the technical infrastructure and principles which were in place to make a rapid response possible — the Product context is simply a timely example to help make what is often abstract more concrete. The stark nature of the times we are in, especially in health-tech, means we should focus now more than ever on our ability as technical practitioners & Software Engineers to respond quickly to our user-base, deliver change rapidly and most of all safely & reliably.

In this piece, I want to briefly summarise what we, as a small Engineering Team, were able to build and deploy over a weekend, and more in the days which followed — but then focus in on the set of Practices and Principles we employ in the Engineering Team which have enabled us to respond quickly and safely to the needs we have met in our user-base.

Who is this aimed at?

Software Engineers, Product Managers & and Technical Practitioners looking to deliver change quickly
Leaders in healthcare excited to find examples of the industry responding quickly
Frontline NHS staff who want to know that they’re being backed up
Conspiracy theorists convinced the outbreak is a government plan (if you record yourself reading every other word here out loud, then play it backwards, there’s a message for you…)

Brief Product Context

The product we provide to our Primary Care user-base enables Clinicians & staff to both send messages to Patients, and to get responses to some pre-built health surveys (e.g. questions about the severity of their Asthma), streaming the results back to their Medical Records — this all largely delivered through an installed desktop application (what!? why!? It’s 2020 I hear you say… sadly the systems we integrate with enforce integrations need to come through locally installed applications… <bangs head on desk>).

(Note: as a business, we also provide a growing web product aimed more at Hospital & community staff, and other members of our team have made some very exciting innovations here as well in recent days, but I’m just going to focus on our GP user-base improvements as the examples for this piece).

The rapid product build

So, on the afternoon of Friday 6th, looking to respond to the growing crisis, a few of us (3 Engineers & a Product Manager) decided to build out a couple of product improvements and features to try out start-of-play the following Monday…

Remote Video consult (partnering with the fantastic whereby.com), where, in brief, we could create “rooms”, share a link to both Clinician and Patient, so they can meet with video & audio, securely, via a browser straight from their phones or computers, without needing to install any App.
A Coronovirus Survey Clinicians could send to Patients to fill in online, with the results streamed straight back into their Medical Record Systems (building on top of our existing remote survey feature).

On Friday evening, when our users went home, no code had been written, by Monday morning, when they came back in, these product improvements were deployed, installed & ready to go. Over the following 7 days, usage ramped up to over 10K Video Consults and 20K Corona surveys per day.

In the week which followed, a team of 4 Engineers, a Product Manager & Clinical Lead were additionally able to take some features, which were already in a basic prototype phase, fully live, enabling:

A simple workflow for Clinicians to get a single response from a Patient to a question, without needing to spend precious time waiting on the end of a phone line, or worse, making vulnerable Patients travel in.
Simple Document Sharing, so Clinicians & admin staff can send documents to Patients digitally without needing the Patient to come in to collect paper or using an expensive (and slow) postal solution.

Being transparent — none of the above were, in isolation, hugely technically complicated. Given the product we had already built had access to Patient contact details, integrations with GP Medical Record Systems, a push notification system, a Patient facing portal to enable Patients to answer questions and a reliable deployment and upgrade service for installed applications… but as anyone who has tried to build software at scale, with many integrations, will tell you — it’s hard to move very fast and not break things — even if you already have, in isolation, the pieces of the puzzle you need.

What I want to explore here is what simple principles need to be in place to enable a small technical team to rapidly iterate, both with existing services and integrations & new ones, with confidence their existing system will not regress, and the new workflows and features can be deployed safely and quickly to end users.

Technical Principles & Practices

In many respects, our ability to respond rapidly to our healthcare user-base in the face of Coronavirus, has not actually been about what we have done in the last few days, it’s been much more about what we have done in the months and years before it, creating the platform & infrastructure on-top of which it was possible for us to accelerate when we needed to. You can only deliver rapid change if you are very confident about the foundations you are standing on, if you are not, you are at risk of hurting your user-base with regressions and service quality issues. And if you don’t trust what you’re standing on, your only option is to move very slowly.

This list, whilst very far from complete, is a summary of the set of principles we have adopted as an Engineering team which have helped us build quickly, and put us in a position where we can respond well in times like this. Any one of these items demands a lot of unpacking to be put into practice well — the aim here is to spark interest for you if there are ideas you’re not familiar with, or perhaps wouldn’t think to put in this list. Some of these you will probably disagree with, and you may have a set of experiences which carry more wisdom than I have, so please comment and let me know! I’m sure I will look back and disagree with some of these in time to come, as I know we’ve got lots to improve on, but here we go…

Trunk is always releasable

At any time, when an Engineer commits a change, affected components or services should be able to be deployed immediately, releasing that value rapidly. Whether that’s a new Feature or a bug fix, there should, by default, be no reason not to push that change to Production. You shouldn’t need to wait for the quarterly Release train, or to put a branch through days or weeks of manual testing. If it’s in Trunk its already been unit tested and signed off — all the automated tests are Green, it can go out the door. There are some exceptions of course, there are sometimes breaking changes which need to be released carefully. But, as a default, when code is committed, it should be able to be deployed releasing value immediately. If you do need to make a breaking change which will stop components being releasable — plan it well, communicate even better, and get back to an always-releasable state rapidly.

This principle need not just apply to trunk-based development, the same principal can (and ideally should!) be in play with git-flow and other branching models… although this can get more murky if you have long lived Release branches…

In short, everyone wins if your code-base is always deployable and not frequently in a blocked state or needing weeks of manual QA to get the green-light.

Tests: the foundation of your code base

In order for any Engineer to be able to commit a change, and release rapidly, you have to rest your weight on your tests. Follow the test pyramid (Martin Fowler summarises here helpfully as ever): have a wide deep base of Unit Tests, some integration tests and a handful of judicious End to End & regression tests. Any Engineer needs to be able to modify or extend functionality, relying on the existing tests to guarantee existing behaviours are not regressed, and new ones to capture the new behaviours. At this point, when that commit lands, it can be shipped, without expensive and unscalable manual QA. If there are bug fixes — they come with tests. If an Engineer makes a mistake when making a change (pre or post review) and the Team thinks there’s a way tests could enforce some system invariant which will stop another Engineer making the same mistake — add the tests, even if they are reflection tests over the code base. Yes — there should be responsible manual testing & acceptance criteria review in the feature branch, but once the code is merged, it’s automated tests which ensure there are no regressions and new behaviours are working from there on.

As a concrete example, 1 Engineer built out our Coronavirus survey over that weekend, needing to touch 1 installed and 3 cloud components. He had only worked a little bit on the Surveys feature before, but the previous Engineers who built this feature had developed excellent unit and end-to-end tests which gave 100% confidence existing functionality was maintained — so an Engineer unfamiliar with the area, could deliver a change quickly, without worrying about regressions and a well established testing pattern which gave confidence the new tests added covered the surface area they should.

If tests are not a first-class concern in your ecosystem, delivering change quickly and reliably is very difficult.

Code Review: strength in numbers

Feature changes, bug fixes, refactors… by default, put it through peer code review. Why? Code Quality, Knowledge Sharing (both ways!), ensuring test coverage, logging & monitoring concerns, building relationships with colleagues, checking ACs (Acceptance Criteria), celebrating good work… the list goes on. Developing software is a Team activity, we should embrace collaboration, and code-review is a key way to do that. The process should be very responsive & it doesn’t need to be painful — build it into your team culture and work at it. A younger (slightly less bald) version of me wrote something more substantive here on this so I’ll say no more…

Deployment: make it fast & simple

Even if Trunk is always deployable, it is essential that it’s simple for a Team to deliver changes into Production quickly and with minimal fuss and ceremony. Whether it’s an exciting new feature, a behavioural tweak or a bug fix, the time from commit to running in Production, should be as small as possible. That’s not to say every change has to be deployed immediately, but it should be the case that as much as possible, it can be.

If there are custom scripts or steps which need to be performed, embody them in a process which is ideally automated, and if not, at least clearly protocol-ised. There are many excellent commercially available CI/CD tools to help here, make use of them and drive the time to deployment, and the complexity of the process, down.

We are a small team of around 10 Engineers, and have worked hard to make sure any Engineer, even ones who have just joined the team, can release all components without any special knowledge or input — just a simple guide, and some good dev-ops tools. We may need more dedicated DevOps in the future but it’s good to aim for a process simple and quick enough that even new-hires can do it without support. It won’t come for free, but the dividends in investing in it, both to the Team and your users, are very large.

We are in the interesting position, where some of our core product is delivered through an installed Desktop application (due to restrictions on how we access key 3rd party integrations) — so our deployment story doesn’t stop at simply having code and binaries deployed in the cloud, it is complete when our users are running the latest code on their machines. We deploy to O(100k), and growing fast, machines, in the NHS, running with various user permissions levels, with a multitude of anti-virus behaviours and framework versions (joy of joys… 😖) — so we have put a huge amount of effort into making sure those machines upgrade quickly and reliably when we release a newer version (which is multiple times a week, sometimes every day). We aim to have 90%+ logged in machines upgraded within the first few hours, with near 100% upgraded by the start of the next day. The details of how we do this, with various different steps and fall-backs when windows installers don’t play ball, is a topic for another time, but suffice to say it’s not an after-thought, it’s a central part of our deployment story, and we keep trying to improve it.

Monitoring: you’re shipping a service not just code

From being actively notified about bugs, to detecting service quality issues — monitoring is completely essential to the service you provide your users. Whenever we make a Product change, one of our overriding concerns is “how will we know when (not if!) this going wrong?”. At any sort of scale, your users will put your application in a set of states you simply did not imagine when building it and testing it. Sure, better Engineers will foresee more, but it’s hard to drive this to zero. If your application gets in a state you don’t expect to be in, make sure your logging tells you you are there, and as best as possible some context which will help you reproduce it! When those logs go off, make sure your notification channel is clear and you have an established mechanism as a team for actioning and resolving these issues. There will be a tendency for the quality of this channel to degrade over time if logs are not actioned, fixed, downgraded or filtered. This absolutely must be actively owned by the team, or you will find yourself increasingly flying in the dark about the real experience your users are getting. A big proportion of the bug fixes we ship as a team come from logs highlighting obscure edge cases, race conditions and sometimes silly logical errors which escape code review and unit testing — we try to fix-and-ship well before users notice and report. (We use sentry.io to process our error logs with dedicated Slack channels for notifications per component — this may not scale forever, but it does us very well for now.).

Effective service monitoring (both your own, and the 3rd party integrations you depend on) — from high-level server health checks to request level performance monitoring — is a core activity of a Team delivering a Product to end users. A team should not just ship the code and drop it over the wall, they also deliver the dashboards and alerts which tell them (and the wider business) how their components are performing in the wild. This topic clearly demands a lot more unpacking on its own, and there are a range of opinions about how responsibilities transition between Developers and DevOps, and at what team size that happens. Suffice to say, however you distribute the responsibilities, your users don’t care who wears the various hats, they do care that your Service is highly reliable and you are preemptive in how you respond to issues, so however your organisation delivers it, invest well in your monitoring and make sure it’s not an after-thought!

Security: you owe it to your users to do it right

The endless stream of embarrassing and damaging security breaches which hit the press on an almost daily basis, are, in the main, entirely avoidable. Sure, there is Stuxnet, but that’s probably not the kind of attack your products and services are going to encounter. Perhaps more so than anything else on this list, doing security well is a vast discipline, and your business will either need dedicated staff, or input from outside professionals. If you haven’t got that, start getting it sorted. But while you are waiting for them to reply to the request for help you just sent… check out the OWASP top 10 (SQL injection still top, of course — some things never change…) and at least get familiar with these. Some key steps we take as a business include: regular package, dependency & framework upgrades (so we never find ourselves months behind a key security patch), detailed 3rd party penetration tests multiple times per year (with immediate response to any discovered vulnerabilities!), a heavy reliance on off-the-shelf security libraries to manage authentication, automated testing on end point security, and active (and growing) monitoring to highlight suspicious user behaviour. What is appropriate for your Products and Services will depend a lot on your context, but whatever that is, you owe it to your users to do it right and care about it.

Feature Flags: first class citizen not after-thought

I used to think Feature flags were quite messy and a bit hacky… but I’ve increasingly come to see that when built as a central concept in your system, they are incredibly powerful. They let you test product changes with key engaged users before wider roll-out, keep branch life short as new features can be in your code-base and sitting in Production before they are activated, do staged roll-outs and performance tests with your real systems, maintaining complete control of code which is running remotely. You can largely avoid risky “big-bang” releases where big new changes go out the door to everyone at once, and it’s much easier to keep Trunk always releasable. They should not be used as a blank cheque to ship badly tested code live and deal with it later — but don’t underestimate how helpful they can be for helping your Team to move quickly and safely.

If it’s not S.O.L.I.D — it will crumble

Software Design principles really matter. You can sort of get away without them if you want to build a small single use application, you know the spec. up front, and it’s never gonna change. However, the best products are built to change & respond as the Teams which deliver them learn more and more about their user-base & their needs. So if you are looking to keep iterating, to be able to re-cut your components and services and recombine them to deliver an ever-more useful product, whilst always being releasable and having a growing Engineering Team — you will need to pay attention to the hard-learned lessons of those who have gone before.

As a concrete example, the reason we could drop Video Consult into our Product in a couple of days, was that the existing components and services we relied on were Single Responsibility (the S of SOILD), and so could easily be re-used and plumbed together quickly in new ways as well tested, reliable blocks. The adoption of Dependency Inversion (the D of SOLID) meant that a few new interfaces could be added, mocks in existing unit tests could be extend with new assertions, and using an IoC container meant registering new concrete implementations required very little grunt work or plumbing.

You won’t necessarily know you need these things, until you want to move quickly with a steady footing, and without them you will find introducing change painful, slow and error prone. Applying these (and other) principles need not be complicated — simple concepts applied diligently goes a long way. (There are plenty more Design Principles than SOLID of course, if you’re not that familiar, start here and keep exploring…)

Diligent Team Players over Rogue Superstars

Building a software Product is a team activity more than many people realise. And the fabric of that team needs to be nurtured and defended as much as the Product itself. There are lots of aspects here, from how you do Team ceremonies constructively and inclusively, to how you on-board people and hold people to account. But above all, it is important that each and every person in the team respects and supports the agreed processes and practices. For example, it can’t be the case that your most experienced or senior Engineers don’t write tests, or your most junior Engineers don’t engage in code review. Whilst there are different roles and different focuses, the core invariants in your Team need to be respected by all. I’d take a diligent team player with slightly less “raw talent” over a rogue prodigy any day.

The Golden Rule: leave the code cleaner than when you started

Leave the code in a better place than when you found it. Little in the universe seems to suffer entropy and a general decline into disorder quite like a code base where people don’t keep improving it. As a team, all agree that any block of code should be in a better state after a change than before, stick to it, and take pride in it. Code review is a key time to check this — if a file/class/method is touched, make sure it ends up easier to understand and cleaner.

As a concrete example — a few weeks ago, routine code review highlighted that how we were building messages from Clinicians to send to Patients could be centralised (it was unhelpfully split across a couple of components). That refactor was executed (and rolled out) as part of a related change a few weeks ago — following the principle of leaving the code in a better place than where it started. This was completely essential, though we didn’t know it at the time, for us building Video Consult quickly as it meant dropping in some new link generation was entirely trivial. Before this routine refactor, it would have been quite a bit more expensive and prone to inconsistencies between different layers.

It’s not always possible to pay down technical debt in every change, but it should be routinely part of what you do… you never know when it will pay dividends!

That’s all for now — there’s lots more you could put on this list, but these are my top ones.

I hope that’s useful to you and the teams you work in…