How To Fail Gracefully

We added multi-region support to event ingest

Rodrigo da Silva (Rigo)
When I Work Data
5 min readOct 15, 2021

--

Photo by Nachelle Nocom from Pexels

When I Work data ingest receives around 20,000 events per minute. Ingesting that many events requires a robust pipeline. We created that infrastructure in 2017, and it does a great job with no significant issues. Well, until AWS went down.

Data ingest is very important to the When I Work Data Team. It is our most critical infrastructure. We’ve had no problem with it since 2017 — without multi-region support. We were pretty comfortable with the way it worked, until November 25, 2020. At 9 PM we started to get non-stop alerts. At 10 PM we looked at the logs to find nothing. The AWS status page did not budge. At 1 AM, we did not know our date ingest would be down for 8 hours; this has never happened before. “Rare events are rare,” my coworker Kevin Schiroo once said. “The probability of rare events will always be unknown.”

We found that we were not the only company having issues, vacuum cleaners started to malfunction. Amazon doorbells did not ring. You looked at the news and saw headlines like “AWS outage takes down a big chunk of the internet.” In the end, it was one AWS service having issues: AWS Kinesis. Kinesis is how we scale up or down our data ingest.

Data Ingest Architecture Without Multi-Region Support

Data Ingest architecture (before multi-region support)

Here is how our data ingest worked before the AWS outage. The When I Work app posts events to our ingest endpoint, which points to our API Gateway located in region us-east-1. About 32 Kinesis shards receive these events — the shards scale up or down to handle more events. The Event Validator Lambda receives the events and puts them in the Good Events Bucket. Malformed events go the Quarantined Events Bucket. Our team is now able to query that information using AWS Athena. This flow is simple and worked well until AWS us-east-1 went down and we lost our ingest for over eight hours.

Why Should We Add Multi-Region Support?

When you go through an event like this, you ask yourself if we should have avoided us-east-1 in the first place. It is not like we did not know that most AWS customers use that region. Some customers avoid it altogether. If you ask me today if I would recommend using us-east-1 if starting from scratch. I would say yes, but add multi-region support. And that is what we have done!

The Journey to Multi-Region Support

Data Ingest architecture (after multi-region support)

The flow is the same for us-east-1, but now we have us-west-2, with a lever. We can control how much traffic flows through each region. And that is worth its weight in gold when issues arise — you can switch all traffic to us-west-2. We have the lever, which is controlled by route 53 DNS weight, set to be at 5% at all times. It is imperative to exercise the process so when we need to use it, we know it is working.

The flow for us-west-2 is the following: events flow from API gateway to 4 Kinesis stream shards. Each shard sends the events to the Kinesis Receiver Lambda. That lambda puts the events into a Kinesis Firehose (s3 bucket). Firehose is a great way to accumulate the events in a safe place and process them after us-east-1 is back to normal. When the events arrive in the firehose s3 bucket, SQS queues trigger the lambda function Bridge to us-east-1. That lambda pulls the events from the Firehouse bucket and delivers them to the Event Validator Lambda in us-east-1. From there the event follows the normal path. Unless the Event Validator fails, in which case the SQS queue will keep retrying.

AWS Limitations

What is the biggest issue implementing this new architecture? AWS CloudFront does not allow two Edge API Gateways with the same address in the same region. For that reason, we had to use a regional API in us-west-2. This is unfortunate as Edge API is better when your users are around the world.

Closing Thoughts and Next Steps

It has been a year since we implemented the multi-region support for our data ingest. AWS has not had another significant us-east-1 outage, and everyone is happy about it. Vacuums and Ring cameras are operational. We have not had to turn on the alternative region to 100% once. That begs the question: was it worth adding multi-region support? My answer is yes, it was worth it. We can now sleep well knowing that our data will get to the lake even if us-east-1 goes down again. As Kevin said, “rare events will happen,” but I would say that being ready for that event is worth every penny.

Have route 53 detect that us-east-1 is down and route the traffic to us-west-2 will be our next step. It will make things much smoother when us-east-1 is down. We will not even know it was down. That may be the best type of architecture, the ones that work so well you forget they exist.

Topic divider

The When I Work app helps hourly employees check their schedules, switch shifts, and even get paid. It also makes it easy for a manager to make their schedule. If you do not know When I Work, check us out at:

--

--

Rodrigo da Silva (Rigo)
When I Work Data

Lead DataOps Engineer at Camber Partners, a learner, a creator, a techie, a health enthusiast, and a believer that consistency leads you to greatness.