Disaster Recovery in a Serverless World — Part 1
Nuatu Tseggai| July 12, 2018
This is part one of a multi-part blog series. In this post we’ll discuss Disaster Recovery planning when building serverless applications. In future posts we’ll highlight Disaster Recovery tests, exercises and the engineering preparation necessary for success.
‘Eat Your Own Dog Food’
Nearly the entire mix of Stackery backend microservices run on AWS Lambda compute. That’s not shocking — after all — the entire purpose of our business is to build a cohesive set of tools that enable teams to build production-ready serverless applications. It’s only fitting that we eat our own dogfood and use serverless technologies wherever possible.
Which leads to the central question this blog post is highlighting: How should a team reason about Disaster Recovery when they build software atop serverless technologies?
(Spoiler Alert) Serverless doesn’t equate to a free lunch! The important bits of DR revolve around establishing a cohesive plan and exercising it regularly — all of which remain important when utilizing serverless infrastructure. But there’s good news! Because engineers aren’t dealing with the minutia of administering a platform (patching server software, etc) they have more breathing room to focus their sights on higher level concepts such as Disaster Recovery, Security, and Technical Debt.
Before we get too far — let’s define Disaster Recovery (DR). In simple terms, it’s a documented plan that aims to minimize downtime and data loss in the event of a disaster. The term is most often used in the context of yearly audit-related exercises wherein organizations demonstrate compliance in order to meet regulatory requirements. It’s also very familiar to those who are charged with developing IT capabilities for mission-critical functions of the government.
Many of us at Stackery used to work at New Relic during a particularly explosive growth stage of the business. We were exposed to DR exercises that took months of work (from dozens of managers/engineers) to reach the objectives set by the business. That experience influenced us as we embarked on developing a DR plan for Stackery, but we still needed to work through a multitude of questions specific to our architecture.
What would happen to our product(s) if any of the following services running in AWS region XYZ experienced an outage?
(S3, RDS, Dynamo, Cognito, Lambda, Fargate, etc.)
- How long before we fully recover?
- How much data loss would we incur?
- What process would we follow to recover?
- How would we communicate status and next steps internally?
- How would we communicate status and next steps to customers?
These questions quickly reminded us that DR planning requires direction from the business. In our case, we looked to our CEO, CTO, and VP of Engineering to set two goals:
- Recovery Time Objective (RTO): the length of time it would take us to swap to a second, hot production service in a separate AWS region.
- Recovery Point Objective (RPO): the acceptable amount of data loss measured in time.
In order to determine these goals our executives had to consider the financial impact to the business during downtime (determined by considering loss of business and damage to our reputation). Not surprisingly, the dimensions of this business decision will be unique to every business. It’s important that your executive team takes the time to understand why it’s important for them to be in charge of defining the RTO and RPO and that they are engaged in the ongoing development and execution of the DR plan. It’s a living plan and as such will require improvements as the company evolves.
Based on our experience, we developed the below outline that you may find helpful as your team develops a DR plan.
Disaster Recovery Plan
- Initiating DR
- Assigning Roles
- Incident Commander
- Technical Lead
- Engineering Coordination
- Leadership Updates
4. Recovery Steps
5. Continuous Improvement
- Lessons Learned
This section describes our RTO and RPO (see above).
This section describes the process to follow in the event that it becomes necessary to initiate Disaster Recovery. This is the same process followed during Disaster Recovery Exercises.
The Disaster Recovery procedure may be initiated in the event of a major prolonged outage upon the CEO’s request. If the CEO is unavailable and cannot be reached DR can be initiated by another member of the executive team.
Roles will be assigned by the executive initiating the DR process.
Incident Commander (IC):
The Incident Commander is responsible for coordinating the operational response and communicating status to stakeholders. The IC is responsible for designating a Technical Lead and engaging additional employees necessary for the response. During the DR process the IC will send hourly email updates to the executive team. These updates will include: current status of DR process, timeline of events since DR was initiated, requests for help or additional resources.
Technical Lead (TL):
The Technical Lead has primary responsibility for driving the DR process towards a successful technical resolution. The IC will solicit status information and requests for additional assistance from the TL.
Communication is critical to an effective and well coordinated response. The following communication channels should be used:
The IC, TL and engineers directly involved with the response will communicate in the #disaster-recovery-XYZ slack channel. In the event that slack is unavailable the IC will initiate a Google Hangout and communicate instructions for connecting via email and cell phone.
The IC will provide hourly updates to the executive team via email. See details in separate Incident Commander doc.
High level steps to be performed during DR.
- Update Status Page
- Restore Datastore(s) in prodY from latest prodX
- Blob Storage
3. Restore backend microservices
- Bootstrap services with particular focus on upstream and downstream dependencies
4. Swap CloudFront distribution(s)
5. Swap API endpoint(s) via DNS
- Update DNS records to point to prodY API endpoints
6. Verify recovery is complete
- Redeploy stack from user account to verify service level
7. Update Status Page
This section captures TODO action items and next steps, lessons learned, and the frequency in which we’ll revisit the plan and accomplish the TODO action items.
In the next post, we’ll dig into the work it takes to prepare for and perform DR tests and exercises. To learn how Stackery can make building microservices on Lambda manageable and efficient, contact our sales team or get a free trial today.
Originally published at www.stackery.io.