EXPEDIA GROUP TECHNOLOGY — SOFTWARE

Learning from Incidents at Expedia Group

Stories of failures leading to improvement

Ben Rogers
Expedia Group Technology

--

An Expedia Group employee in a black T-shirt and headphones studies their laptop screen
Photo: Expedia Group

Incidents at Expedia Group® (EG) provide a window into the inner workings and health of services across a sprawling technology stack. Whether it is legacy infrastructure in specific areas or newer platforms being produced to consolidate older architecture on, problems arise on a regular basis that rise up to the level of an incident, requiring a rapid response from multiple teams. After the dust settles, teams do their best to assess the root causes of what happened and implement corrective actions to reduce the risk of recurrence. From all of these activities, EG has a wealth of learnings that we can share. We’ll summarize the areas where incidents happen most often, with suggestions on how to mitigate the risk that arises from them. More in depth analysis is provided in the Details section.

Traffic spikes can take the form of malicious bot attacks, good bots (e.g., Google crawlers), and plain old organic traffic growth during mass advertising events (e.g., major event sponsorship). In order to combat these spikes, the following are some things that can help reduce the risk of incidents occurring:

  • Testing in various forms (e.g., pipeline performance testing) on an ongoing basis, as well as general knowledge of all the general bot mitigation strategies that are available and in place already.
  • Within the space of testing, capacity testing for EG’s write infrastructure does not exist today; with it, Expedia could potentially detect many data store issues early.
  • Quick auto-scaling (fast app startup time) can assist in quickly ramping up capacity to handle increased traffic.
  • Load shedding capabilities being built into your services and infrastructure is a nice fail safe when the above have not helped.

Good alerting and monitoring available for resources such as CPU, disk, and memory for your service are absolutely critical to ensure resiliency and stability.

Sustained efforts to migrate from legacy services and unmaintained open source projects in the critical path for any workflow that have little operational support and no active development is critical to lessening the risk of incident occurrence.

Organizational awareness of ongoing incidents in both external and internal contexts through easily accessible channels helps reduce the length of incidents and also minimizes duplication of detection efforts.

Details

Traffic spikes

Spikes in traffic fall into four broad categories:

  • Bot attacks — These are coordinated efforts by actors with varying motivations to either deny service to legitimate users of EG’s services or to scrape data from EG’s valuable set of data related to travel.
  • Crawlers — Legitimate bots, such as Google’s crawlers, can sometimes create spikes in traffic (they are usually well behaved, but not always).
  • Misconfigured/unintentional traffic redirection — Sometimes, services are misconfigured, causing unnecessary traffic loops or spikes. It’s worth noting that services should be able to plan for this and handle these occasional miscues until the issues are resolved.
  • Organic traffic growth — Spikes in these cases typically are legitimate and can be tied back to mass advertising events where users hit EG’s sites or mobile apps simultaneously through a call to action.

For more context on traffic spikes and bot attacks, we can turn to some of the incident analysis performed by Fabian Piau. At the end of 2020, Site Reliability Operations (SRO) and Problem Management (PMT) completed a walkthrough of incident themes from 2020 for Hotels.com™ (part of Expedia Group) with Fabian. They in turn analyzed this data and created a presentation outlining their findings. One of the main takeaways was that bot attacks have caused an increasing amount of incidents on the Hotels.com brand, and you can check out a lot more detail as outlined in Fabian’s Medium post.

In addition to some of the strategies outlined above, the following ideas and recommendations outline other best practices that can help your applications scale to meet these various flavors of traffic spikes.

An Expedia Group employee in a brown T-shirt works on their laptop next to a colleague.
Photo: Expedia Group

Ongoing production capacity testing

One of the most impactful incidents in the history of EG’s vacation rental brand, Vrbo™ (part of Expedia Group), resulted from a major spike in traffic. In January 2019, Vrbo , back when it was still VRBO, became the sponsor for the Citrus Bowl, a mid to upper tier college football bowl game played on January 1st every year. The 2019 edition featured Penn State and Kentucky, drawing over 7.7 million viewers. Viewers were encouraged to download the Vrbo app, and they did so in droves. This caused a massive surge in traffic, overwhelming much of the Vrbo tech infrastructure at the top of the traveler funnel, causing significant outages. After the day was over and the dust had settled, teams began to review where tuning and fixes needed to be implemented. The tech organization as a whole doubled down on its commitment to perform Peak Capacity Testing in production on a routine basis, which uncovered many issues that were mitigated prior to becoming a problem during traffic surges. The results of this have paid off. Most notably, in 2020, Vrbo remained the title sponsor of the Citrus Bowl; even though the bowl drew 14.3 million viewers, almost double the 2019 edition, the infrastructure performed quite well, with no outages reported (and same thing in 2021). The main lesson from this is that regular efforts of subjecting production infrastructure to peak loads is key to ensuring that traffic surges do not overwhelm your applications or infrastructure. As well, it is critical to remain in touch with your partners in Marketing, especially if advertising windows can be planned for in advance; campaigns such as the recent Expedia brand relaunch can also be major traffic drivers, so it’s always good to stay aware, as best as possible, of major advertising events.

It’s worth noting that one gap EG has identified in its current capacity testing is within environments that require writes to persistent storage. We have seen a number of incidents where having some type of capacity or stress test would have helped identify some of the bottlenecks that ended up causing the incidents with data stores. However, load and capacity tests in production for services that persist data can have significant and adverse affects (e.g., data that goes on to live forever since, many times, there are not good and safe ways to delete test data in production that was created), and having a comparable test/lab environment is cost prohibitive. Efforts continue on this front to close this gap, but EG is always looking for ideas and new talent to come help us solve complex problems like this.

Quick auto-scaling and failover

Have you recently checked how quickly you can scale your app, or maybe even failover to a different region? Traffic has come back massively in the first half of 2021 for EG as travelers have begun booking again, and we’re starting to see a theme of scalability issues cropping up across tech stacks at EG. In March 2021, a large incident occurred with significant impact on EG’s Hotels.com brand. While the true root cause could not be pinned down, it appears that a network glitch led to a significant rerouting of traffic that would normally go to a caching layer, and sent it to the backing data stores instead, which ended up overwhelming them. Normally, in such a scenario, a failover from one AWS region to another would take place in order to mitigate the issue. Due to a number of issues, the infrastructure for failing over from the current region to its backup region was not fully spun up, and it had not been practiced in recent times. This led to a much longer time to restore on the incident (almost 13 hours in total). At EG, our Site Reliability Operations (SRO) team holds Post Incident Reviews (PIRs) for our highest impact incidents. During the PIR for this particular incident, it was determined that teams should be practicing failovers on a more regular basis in order to more quickly recover when similar failures happen again in the future.

Another major incident that happened in March was in the Vrbo vacation rental brand’s infrastructure. With the rise in traffic in early 2021, Vrbo’s traffic has been spiking and setting new records on a regular basis. Some of the backing APIs and data stores which had been able to handle lesser loads for years started to reach a breaking point, causing intermittent outages up and down the traveler funnel. Some of the lessons learned from the PIR for this particular incident include regular app health checks, and reassessing the amount of resources given to an app to make sure you have some head room. Additionally, cost considerations might come into the picture for you and your app depending on where your traffic and company’s budget are, so loop in finance partners sooner rather than later in order to ensure all options can be explored to mitigate the issues. One last point for this particular type of overloading scenario–always be mindful that you are not overwhelming the portion of your infrastructure that might not be able to scale up quickly. In other words, if your data store is on fixed assets and has limited possibilities for scaling up, you won’t want to overwhelm your data store when your apps are quickly spinning up new instances.

Not all incidents highlight shortcomings in the technical infrastructure; sometimes, they highlight how well things like auto-scaling are working. A series of incidents in May 2021 with the Hotels.com shopping app were quickly resolved without any human intervention; none of the short incidents lasted longer than 20 minutes during traffic spikes. Having a quick auto-scaling infrastructure and an app with minimal dependencies and startup time allows for the impact to be mitigated in a rapid fashion. This also gives the development team supporting the application time to handle fine-tuning of the application’s resource utilization as part of their normal development cycle, rather than being constantly interrupted to hop on an incident teleconference to diagnose and triage issues.

A person in a blue jacket and dark hoodie works on a laptop in a cafe setting
Photo by Brooke Cagle on Unsplash

Proper resiliency and defensiveness built in

There have been a few examples of incidents where proper resiliency was not built into services in order to handle traffic spikes. In February of 2021, one of Expedia® Partner Services’s (EPS) application teams rolled out a new feature, causing a significant increase in traffic to lodging services, thereby degrading hotel and vacation rental listings on various EG brands and sites. Mitigation of impact was observed after disabling calls to a specific lodging service. In the PIR, we took a deeper dive on timelines, causes, and outstanding risks from the incident. Both the Lodging and EPS teams came to the review with well prepared documentation and corrective actions already identified, including:

  • Quarterly resiliency exercises
  • Improved documentation for restoring from backups of the affected data store as well as general disaster recovery scenarios
  • Investigation of why some applications were not respecting DNS updates
  • Application to application timeout review to ensure synchronization
  • Plans/brainstorming sessions around mitigation of one client causing impact on additional clients
  • Ability to fine tune rate limiting per client application

One other piece of defensive technology is a load shedding strategy. In December 2020, a major incident in lodging across multiple brands exposed this need for the applications maintained by the Property Content Distribution team. A very good primer for how load shedding can work can be found in another post on Expedia Group’s blog. Other good references include Netflix’s blog as well as the Amazon Builder’s Library. While there are some sophisticated strategies contained within those links, even having a simple way to shut traffic off briefly to relieve pressure on systems is a good starting point. This is exactly what the Property Content Distribution team realized, and they took the time to update their run-books and instructions for shedding load. For teams that already have some form of load shedding strategy in place, enhancing these patterns will lead to increased resiliency.

Monitoring for CPU, disk, and memory

CPU, disk, and memory…how close are your services to their limits? In recent incidents, we have seen services begin to run out of resources that might not have been thought to be that close to the edge previously. For example, in one incident in early May, jobs for pulling and pushing indexes from Amazon S3 that had been running for quite some time and copying data over ended up failing due to disk space running out on the destination nodes. Two simple fixes in this case involved adding monitoring for disk space consumption, and increasing the amount of disk space available on nodes.

In some of the Vrbo scalability issues that have been discussed previously, and many subsequent incidents related to one of the major databases in Vrbo, CPU utilization has been running in the 60% range on the server for some time. This has been a known issue, but recently any major influx of transactions from various sources have sparked outages ranging from minutes to hours, impacting a good bit of Vrbo’s infrastructure. The simplest temporary fix in this case was to add more hardware. Past that, for each of these incidents, plans are in place to redirect load to read replicas, and longer term efforts are underway to either decommission the underlying data stores in favor of newer platforms, or, at the very least, take even more load away from them. These both highlight the need to check where you stand today on the resource utilization for your services, especially as traffic continues to come back and increase as travel continues to open up across the world. Give yourself a little bit of extra room if you can.

Finally, it’s worth noting that each of these tie into items that our Site Reliability Engineering (SRE) team publish as standards (these standards are directly tied back to the AWS well-architected framework and monitoring and suited to EG’s specific needs).

Knowledge of ongoing incidents

While there are many communication vehicles that EG’s Site Reliability Operations (SRO) organization uses to spread the word about incidents that impact EG, it’s worth noting a couple of additional vehicles that really make a difference for incident awareness.

Three people in a conference room look at content or a speaker in front of them and take notes
Photo by Christina @ wocintechchat.com on Unsplash

AWS outages

If you are a consumer or user of AWS services, The AWS status page is a vital resource for understanding if outages are happening that your services rely on. Major incidents such as outages in us-east (with an excellent write-up from AWS available here) have been tough to initially pinpoint as EG teams dive into the problems and assess the state of their services, only to later on find out that an AWS outage was the main cause. EG’s SRO group took it upon themselves to create a Slack integration with the relevant AWS status page RSS feeds to notify internal EG stakeholders when outages are happening. The really nice thing about the RSS feed is that it has status updates about incidents in real time. It posts in Slack each time there is a status update on the incident until it reaches resolution, meaning that you are getting updates in near real time. SRO also took this a step further, and created a separate Slack channel that aggregates from the AWS blog and What’s New feeds in order to help teams have a one stop shop on updates that are coming from AWS (it turns out there are a whole lot!).

EG Contact Center Services agent outage notification

As you can imagine, the COVID-19 pandemic put a significant strain on EG’s Contact Center Services, where customers can call in to request trip modifications, cancellations, and have many of their urgent queries answered. It is vital that the agents that staff these centers are kept up-to-date about happenings within EG’s technical infrastructure. One significant win that was implemented in the beginning of 2021 was a ticker display to give agents an overview of ongoing incidents on their desktop. This allows them to understand if an incident is ongoing that might be impacting one of the services that they use, so that they do not waste additional time reporting it in duplicate. Also, incident bridges in Slack and voice are open to the entire company, so employees can join to receive ongoing updates during the outage if they so choose. Time not spent on these activities is time that they can spend helping EG’s customers, travelers, and partners.

End of life for legacy services and open source projects

Success and stability can be a double edged sword, both for internally developed services and open source projects. In the Vrbo technology space, an aging infrastructure of core API services has led to a significant effort to bring autonomy and ownership to teams, as opposed to having a common set of RESTful APIs maintained by a single team and contributed to by others. This has been necessitated by the switch to a streaming platform, as mentioned in previous EG Medium posts such as Fully Reactive Stream Processing with Apache Kafka and Project Reactor and The Need for a Stream Registry — Intro. Client teams have been working steadily to make progress on this switch, and one of the main goals is to have each maintain a coherent API for the entity types and operations in their domain.

Dependencies on dormant open source projects

EG’s technology teams have incorporated open source projects into internal projects for some time. Like much of the rest of the tech industry, EG would not be successful without open source code and projects to build domain specific technology upon. One gotcha, and something to watch out for in your infrastructure as well, are open source projects that have gone dormant, especially ones with many known issues that, while you may not have hit yet, may some day manifest themselves in your environment. In a recent incident, we saw another instance of the Hystrix open source circuit breaker library contributing to a portion of the impact. For teams that are using it, many of them are actively migrating to open source projects that are more actively maintained (most notably, Resilience4j). As teams have been implementing Resilience4j, they have been sharing great advice, such as:

  • Disable writeable stack traces to avoid log spam.
  • If you do not want queues when using a thread pool bulkhead, make sure to set the queue size to 1.
  • Bind respective registries to meterRegistry/metricRegistry to publish metrics information about the circuit breaker health and thread pool usage.

In summary, EG takes the opportunity to learn from incidents and grow. We look forward to sharing more from these opportunities in the future!

Learn more about technology at Expedia Group

--

--