Dealing with a production incident after midnight
When then-junior software engineer Dominic Fraser spotted a late-night Slack message about a serious problem with Skyscanner’s Flight Search Results page, he was drawn into an investigation which stress-tested — and ultimately showcased — company culture
Skyscanner maintains 24/7 support for many of its services, and, while some of this is covered by having international offices, it is common for engineers to be on call during the night.
This post walks through the first out-of-hours production incident I was involved in as a junior software engineer. The responsibility was not new, but taking on complex problem-solving during the night was!
It will cover some specifics of the incident and, by describing the event sequentially, show how Skyscanner’s culture of autonomy and ownership, regardless of level, are critical for ensuring our travellers receive the experience they deserve from Skyscanner.
On a recent Wednesday evening at 21:22 UTC, Skyscanner’s Flight Search Results page showed a sustained increase in HTTP 500 error responses. In the early hours of the following morning, 9% of our Flight Search traffic was being served a 500 response.
Time to detection: ~2 minutes
The ‘GAP’ (Global Availability and Performance) squad received an automated Victor Ops page at 21:22 UTC relating to site availability alerting their on call engineer to an issue, which they acknowledged within 2 minutes and got online to start investigating.
Investigations get underway
Upon seeing this was an ongoing issue the above message was posted (at 21:44 UTC) in a public support channel. This created visibility and signalled that an investigation was underway while providing a central point of communication/focus for any other engineer receiving alerts or seeing problems that might have been related.
This was seen by an engineer with the Flight Search Squad (FSS) that owns the Flights Search Results page, who had also received an alert relating to 500 responses for the service and was investigating, and myself, part of a core web infrastructure squad.
Below is a simplified request flow that shows some elements that will be referenced later. The key point right now, however, is the ‘readiness’ endpoint of the Dockerized Flights Search Results Express application. This is used to determine if Request Handler should send a request to an instance in a specific AWS region or failover to another region.
FSS had been alerted to the ‘readiness’ health check endpoint in two AWS regions showing increased failure rates for the Flights Search Results service, with traffic automatically failing over to secondary regions.
Initial investigations validated that the HTTP 500s were all from this single service. This was important to establish as early as possible and it helped us narrow down possible root causes and know if other engineers needed to be paged.
At this point, our Flight Search Squad was also aware of problems occuring in their AWS Redis ElastiCache; its number of current connections dropped and extreme fluctuations were clearly observed.
When drilling down into lower-level metrics now there was more direction, and it was seen that the Travel API (depended upon for geo-data) was returning non-2xx responses.
At this point this information was all being posted in the Slack thread, giving full visibility in real time.
Seeing that the Travel API limits were indeed being reached I noted this in the ongoing thread at ~22:00 UTC. It also seemed to me that manually failing over more traffic to secondary regions was inadvisable, as rate limits would then be exceeded in that region if no other changes were made.
This brings us to the first non-process, non-technical, point that stands out to me: I felt safe to contribute.
By joining the dialogue I was certainly opening myself up to future questions I might not be able to answer, but I knew that the engineers involved exemplified Skyscanner’s blame-free and inclusive culture, and so I felt psychologically safe enough to get involved.
Priority 1 status assigned
As it was evident that there were several layers to be investigated a video call was started at 22:15 UTC.
In that call, the incident was classified as a P1 (Priority 1) because the impact of sustained HTTP 500 errors was adversely affecting more than 1% of Skyscanner’s users. This prioritisation was important because all P1 incidents at Skyscanner require 24h support from engineering teams regardless of where they are located in the world, i.e. more people might need to get out of bed.
Mitigating the incident
We initiated multiple threads of investigation with the aim of either discovering and fixing the root cause, or at a minimum mitigating the ongoing issue. Independently each of the three engineers was walking through different logs and dashboards, while screen-sharing as appropriate. The Travel API’s on-call engineer was paged and asked if their rate limit could be increased.
It was seen that the increase in load on Travel API was not completely proportional to the 500s being seen, there were more 500s visible in Request Handler than downstream at Travel API.
The root cause of the cache failure remained elusive, with no logs showing any obvious pattern to indicate what had caused it to go down.
Logs did show the cache failing to be written to, and looking in the AWS dashboard it was seen that the master (primary) and slave (replica) nodes of the failed caches were out of sync with what was specified in the Flights Search Results service. This would mean that it would be trying to write to a replica node, which would not be possible.
It was not known at this point how they could go out of sync, but a redeploy of the Flights Search Results service was initiated with updated values. Thanks to Skyscanner’s internal tooling once a pull request has been merged an automated chain of events takes over to deploy to production, but this is not instantaneous.
At ~23:50 UTC the Flights Search Results service showed full recovery as the redeploy with an updated cache configuration completed 🎉
However, we were not out of the woods yet… About 10 minutes later multiple regions again showed cache failures, and 500s abruptly spiked.
With additional eyes now looking for anything that had been missed, or assumed, the main investigation now uncovered a new piece of information: the Flight Search Results service readiness check included a check for cache health. This provided sudden clarity on the amount of 500s seen!
With the cache connections so unstable the traffic was flip-flopping between regions, alternating between instances marked as up or down over short periods of time. This would mean the chances of hitting two ‘downs’ was much more likely.
Decision time. The cache failure was still unknown so we had to decide whether to continue investigating in the late evening/early morning hours, or simply mitigate traveller impact by surviving without a cache. Latency would of course increase, but responses would at least return to normal and our travellers would no longer be affected.
We decided to remove the cache health check in one region, while setting up to realign the cache nodes across all others. The effect was immediate on its deploy, spiking requests by over 15x, for 15 minutes well over the still-increasing limit.
With the cache nodes realigned and cache check removed from the readiness endpoints, this spike normalised and our global HTTP 500 errors reverted to below alerting limits.
It was now ~03:30 UTC. Having seen Travel API and other downstream dependencies handle the increased load without a cache, it was determined that any further steps could wait until the UK morning. A summary of progress was posted in the Slack support channel and we did a handover to the APAC GAP squad member starting their morning in Singapore.
While not fully resolved, the traveller-facing impact was removed and we could step down from P1 status. Back to sleep. Total time to mitigation: 5h 40 minutes.
The following morning
The following day investigations continued, with a temporary Slack channel set-up to keep clear and publicly visible communication organised.
For the engineers who had been up the previous night it was their choice to come in later, leave earlier, or take time off in lieu to make up for the unsociable extra hours worked. By leaving a clear summary before signing off for the night it meant work could be picked up immediately by other squad members in the morning, with no time spent waiting for information buried in someone’s head.
Applying the common ‘five whys’ diagnostic approach gradually uncovered the root cause of the problem:
Why were users seeing 500 pages?
- The Flight Search Result service readiness check relies on a healthy cache, and the cache was failing.
Why did the cache fail?
- The CPU limit on a single core was reached. As Redis runs primarily on a single thread, it is bound by a single core’s capacity. When the CPU limit was reached, Redis was no longer able to hold a stable connection with clients.
Why was the CPU limit reached?
- Total load on the Redis process was too great.
Why was the load on the Redis process too great?
- There had been a major increase in the number of set commands to an extent the cache node was full. Items therefore needed to be evicted from the cache. The need to evict contributes to the CPU time needed by the Redis process. As the set commands were continuing at a high rate the maximum load was continually exceeded.
Why was there an increase in set commands?
- A component with high cache key cardinality (including the date of a flight search in the key) had been released from behind an A/B experiment to 100% of travellers and was hammering the cache. This release was done by an external squad, so was not seen in the Flight Search Result service’s release logs.
We also uncovered why the primary and replica cache nodes were seen to go out of sync. A mistake in the Flight Search Result service’s configuration had the master endpoint set directly to the primary node’s endpoint, rather than the cache cluster’s primary endpoint. When pointing at the cluster’s primary rather than the individual node ElastiCache automatically handles changes within the cluster.
- Immediate roll back of the offending component, with changes subsequently made to remove the unintended inclusion of a traveller’s travel dates from the cache key, reducing cardinality.
- Ensure all downstream dependencies can scale to take all Flight Search Results traffic in a single region at peak hours, understand the scale up period expected, then remove cache health check from all regions
Update the Flight Search Results service cache configuration to:
- Point at the cluster’s primary endpoint
- Add additional alerting around CPU Usage, SWAP Usage, Evictions, and Current Connections as recommended by AWS
- Increase reserved memory % to 25 to handle background process as recommended by AWS
- Investigate running Redis with cluster mode enabled
The squad owning the Flights Search service then organised a postmortem. These are blameless retrospective and review meetings open to all engineers within Skyscanner. A debrief is prepared in advance, and iterated on afterwards (based on feedback and supportive contributions) for future reference and learning. This gave the squad an opportunity to share the knowledge they had gained from the incident.
I found it very encouraging to be treated as a peer regardless of seniority — both during and after the incident. Good input was recognised, while my requests for deeper explanations were fielded without judgement in a psychologically safe environment. This definitely set me on a course of jumping on new problems without fear!
Join Skyscanner, see the world
Life-enriching travel isn’t just for our customers — it’s for our employees too! Skyscanner team members get £500 (or their local currency equivalent) towards the travel trip of their choice in 2019 — and that’s just one of the great benefits we offer. Read more about our benefits and have a look at all of our open roles right here.
About the author: Dominic Fraser
Hi! My name is Dominic Fraser, a traveller turned Product Manager turned Software Engineer. No day is the same, and as seen by this post sometimes no night!
I’m based in Edinburgh, but have been lucky enough to visit colleagues in Sofia, London, Glasgow, and Budapest!