Disasters & Platform teams — How to deliver applications you can be confident in.

Harry Morgan
Engineering at Birdie
6 min readJan 11, 2023
Photo by NASA on Unsplash

No one likes to think about the worst case scenario.

Bringing an umbrella in case it rains, clicking save on your document every 5 minutes or looking both ways before crossing the road. These behaviours become natural over time, and are learned from years of experience and come with knowledge of what might happen if we don’t do them.

Disaster Recovery is a practice that attempts to mitigate the impact of disasters on a system. At Birdie, that could be any kind of event that might disrupt a modern distributed system.

What is a “disaster”?

At Birdie, a disaster is something that could stop us providing our service to our customers — those are care providers, care managers and ultimately vulnerable people who rely on our system. This can be anything from a natural disaster damaging data centre infrastructure, a malicious attack by a hacker, or simply human error in development. These events are rare but one of the right size could bring a halt to most businesses in the line of fire.

We cannot tolerate the consequences of these events. It would mean loss of data, leaving carers at loose ends in arranging their visits or logging important information. Ultimately having a negative impact on the lives of carers and care managers, which is the very opposite of our mission.

Recovering from disasters

Disaster Recovery (DR) is the practice of having tools in place that we can rely on to ensure Birdie will still exist if the worst happens. It’s a goal that engineering organisations build towards.

What’s important to note here is that Disaster Recovery is very much a moving target, not one Platform team at any business can say they’ve “completed disaster recovery”. It’s a set of practices that move with your organisation and evolve as your engineering organisation evolves and progresses.

At Birdie these practices have an origin within The Owls (The Infrastructure & Security team). We develop and advocate for tooling to achieve our best effort Disaster Recovery process. Considerations of these processes can take many forms, and span across a development lifecycle. A few technical areas to note:

Application code

  • Can we rely on our application code to deploy reliably onto infrastructure that it doesn’t usually?
  • Are there any hardcoded parameters in our production environment that we’d miss, or configuration that was manual and not documented?

CI/CD

  • Will our deployment practices slow us down in the event of a disaster? Are there any secrets or variables being injected into code that can’t be replicated to other environments or regions?
  • Can we rely on our pipelines to be successful in the event of a disaster? How confident are we in builds succeeding?
  • Are our deployment practices reproducible locally? What if our CD runners aren’t available?

Infrastructure

  • Is our infrastructure maintained in code, and can it be replayed at whim?
  • Are all resources in our infrastructure available in regions we plan to deploy into?
  • Are there regular backups available and are we confident in being able to restore from these backups?

End-User

  • What will the impact on the end-user be if we have to recover from a disaster? Will their data persist? Will our DNS records change?

Again, it’s important that all of these aspects are considered in developing processes for disaster recovery — but these responsibilities don’t just sit with the platform team in an organisation — it’s important that there is buy-in. Another thing for developers to consider — how can we ensure the burden of DR is shared?

Sharing the love

Everyone is very busy.

We’ve all got roadmaps filled to the brim until Q1 2050 and requirements flying in left, right and centre. How can an engineering team be expected to plan for heavy and slow investigations and possibly redevelopment? This is the kind of work where the benefit can border on the ambiguous. Building and tearing down a deployment, untying hard-constraints in pipelines, ensuring container images are backed up and updated regularly, environment variables aren’t hard-coded — this is work that drags — and once you’re done, you don’t have a feature in front of you or a carers’ need satisfied.

This perceived problem in this case, I believe, comes down to communication and interactions between platform and development team. If a platform team considers a development team their customer, the problem gets reframed slightly — and this reframing can be really important in delivering a robust disaster recovery setup.

On one end, a platform team can focus on delivering tooling and abstractions that can benefit development teams in their comprehension and understanding of their role in disaster recovery. In a developer-is-the-customer frame, these products are delivered to engineering teams and as with any client relationship, feedback can be provided and time can be spent pairing on implementations and communicating expectations.

This means engineering teams or platform teams aren’t expected to “solve” disasters recovery themselves, but work with a shared tooling and requirements with which they can plan steps to make services resilient. Following this, the onus of prioritisation is on the product managers and developers within a team (another hard thing), but the work is less abstract — it’s no longer “DR for 2 sprints” — it can be thought about more as “continuous improvement” work.

And that, I think, is an important pattern to adopt in engineering.

We’ve already seen the rapid rise of shift-left methodologies and practices like DevOps and DevSecOps move more of the operations side of engineering into the development lifecycle. Concepts like deployments, infrastructure and security are now commonplace within development teams — and their maintenance is considered when planning, because it is work owned and operated by that given team. I simply believe building disaster resilient code (application or infrastructure) is another concept that should be shifted left.

I feel it is important to note at this point that Birdie haven’t nailed this either. We’re currently in the throngs of jamming disaster recovery tooling into already-built systems and already-planned backlogs — but with a vision to come out the other side. These tangles can be avoided.

Building for confidence

Disasters aren’t edge-cases. Any incident can snowball into a disaster without the proper foresight. But if you’re lucky enough to never see your system fall away from you — what exactly is in it for you?

Well, a lot. But the headlines are:

Confidence and portability.

That doesn’t just mean confidence internally. Being able to answer questions from customers regarding recovery time if your system is compromised with certainty is a reassurance we must afford people in a world ever governed by and dependent upon complex distributed systems.

While portability doesn’t quite strike the same vital chord — it’s a very underrated aspect of a modern system. Having the ability to prop up your application in new environments or regions gives a product agility — being able to navigate complex problems like data residency requirements or international expansions with ease, as the bulk of the technical problem has already been solved.

Ultimately, I believe instilling resilience as a core value in an engineering organisation is just as vital as considering security, fast-paced deployments and robust testing when beginning to build infrastructure or applications. It encourages a culture of confidence within an organisation — confidence in deploying consistently and in large volumes, confidence in building-out on-call practices, confidence in growing a user-base at pace.

So to bring it back to our clickbait-y headline, how do we deliver applications we can be confident in? As far as I believe it’s a team effort. It’s on everyone to consider resilience and recoverability in every stage of the development lifecycle, have platform teams deliver tooling and products to application engineers that demystify and untangle complexity, make deployments lightweight and portable, practice recovering from disasters, and understand that disaster recovery isn’t a box that is ticked, but a ladder you climb, which you’re constantly adding rungs to.

That’s a lot to consider and change. Organisations that aren’t already building for disasters will find it hard to prioritise and fit in, but ultimately, at the end of the article, if nothing else is taken from it; I hope engineers or managers might ask themselves; “am I confident that my system could recover from a disaster?”.

We’re hiring for a variety of roles within engineering. Come check them out here:

https://www.birdie.care/join-us#jobs-live

--

--