S3 Region Failover and Alerting
Once upon a time, there was an AWS S3 outage for the US-EAST-1 region which lasted ~5 hours. This led to cascading outages for many online services. Sites became unavailable. Financial services were interrupted causing significant monetary loss. In Invoca’s case, some calls were negatively impacted since some IVR voice prompts were not cached across all servers and could not be downloaded from S3 in realtime. However, we were able to quickly remedy the issue by syncing our IVR voice prompt cache across all relevant servers. Calls are our lifeblood and any negative impact is unacceptable.
Official outage response: https://aws.amazon.com/message/41926/
The moral of the story was: use cross-region replication. If one region fails, fall back to another region which has the same assets replicated to it.
Sounds easy enough.
In the event of an S3 request failure, retry the request against a different region from a defined set of available regions.
How do you know if your backup regions are working?
Rather than wait for a request failure to try a different region, it is better to rotate through regions across requests. This can be done by persisting the last region utilized and rotating the region used for each request, or always starting at a random index in your region array.
This allows for region variance across requests even if no failure occurs. Put another way, there should be no distinction between primary and backup regions. The problem with emergency backups is that a faulty backup might not be discovered until it is actually needed, which is precisely the wrong time to make that discovery.
What happens if more than one region fails?
Keep track of the individual retry attempts for a particular request. Once the retry attempts equal the total number of configured regions, it’s time to give up and raise an exception.
retry_attempts = 0
rescue *NETWORK_ERRORS => ex
retry_attempts += 1
Pro tip: The modulo operator is rather handy for iterating through a circular array as it allows the index to automatically reset to 0 once the end is reached.
(s3_region_index + 1) % s3_region_configs.count
Alerting when S3 requests are infrequent
For a certain use case of ours, individual servers attempt to download and cache an asset from S3 intermittently. There’s no set pattern for how often these cache attempts occur, it is based on variable load and customer configuration.
Since these cache attempts are not reliably frequent, we use a gauge metric for alerting. Gauges persist the last reported value which allows them to be used as a toggle-switch. A failure gauge is considered OFF when set to zero and ON when set to one.
Currently, we have a failure gauge configured for each region.
To prevent connectivity problems seen by individual machines from triggering false alarms, we check to see if 3 or more servers have experienced a failure for a particular region before firing a warning alert.
Also, if a single server has experienced failures for ALL configured regions, then we fire a critical alert.
In the case that all regions fail for a request, we end up triggering both an exception (to help with debugging) and an alert (to proactively investigate the issue).
Use Cross-Region replication on your S3 buckets, test your backups, and have an explicit alerting strategy.