EXPEDIA GROUP TECHNOLOGY — PLATFORM

How to Manage a Platform Handling Millions of Requests per Day

Supporting the Flights platform fueling close to a billion transactions each day

--

Yellow jet plane covered in travel stickers, flying through the sky

“Expedia Group™ wants to be a place where exceptional people who share our passion for technology and travel want to do their best work”

I am part of the Flights team at Expedia Group which powers the platform from searching through to booking and ticketing of flights.

Managing and supporting a platform processing millions of transactions each day across more than 50 point of sales to facilitate searching, pricing, booking and ticketing flights, is a huge and complex task. Here are some examples from my experience as how we were able to achieve this, ultimately reducing the below metrics for any issue coming our way and “Being the first ones to know of any outage”:

MTTD — Mean time to detect

MTTK — Mean time to know

MTTR — Mean time to resolve

We have 24x7 support provided by two dedicated teams doing the support rotation for two weeks, with members from Asia-Pacific (APAC) and North America (NA). Primarily during the support week, each team needs to resolve any Flights-related issues that arise. The idea is to rotate the support responsibilities across teams instead of having one dedicated team (Tier 1) responsible for support operations only. This solution is dual purpose:

  1. Building the overall Flights platform competency across teams (as each team is specialized in one of the Flights domain components)
  2. Everyone is responsible for cleaning the house and keeping the platform bug free, enabling even distribution of responsibilities across teams

The Air stack comprises over 50 different services including Search, Pricing, Details, Offer, Booking, Ticketing and GDS (Global Distribution System) specific APIs. For each service, we have a primary and secondary owner (Tier 2) both based out of APAC and NA to provide 24x7 support coverage.

Major responsibilities of a Tier 1 support team are:

  1. Handling Live-site issues
  2. Resolving all tickets logged with Air Support team
  3. Site Availability Analysis

Handling Live-site issues (PagerDuty)

We have Network and Operations teams at Expedia Group whose primary job is to monitor the traffic and outages across all point of sales. Whenever there is any production issue, these teams immediately engage the Flights team who is managing the support that particular week. To enable that, we have set up PagerDuty accounts which notify the required support team of the production issue, so that they can join the discussion and help in resolving the issue. Support team members need to ensure that they accept the notification (phone call, SMS, email) sent by PagerDuty and join the discussion bridge. We have created the daily roster so as each team member attend to PagerDuty each day (in rotation) thus enabling competency building in the team.

Resolving all tickets logged with Air Support

Tier 1 team is responsible for resolving all the bugs logged by internal and external teams during their rotation. This way, we ensure that our stack is clean and error-free.

Availability reports

We also run the automated availability reports on various point of sales throughout the day to check for any errors on the shopping stack. This helps in ensuring that stack is up and running with 24x7 availability.

Splunk Alerts

Each and every service is integrated and monitored via Splunk. We have set up Splunk alerts which get triggered, if any service goes down or if the success rate is going below a certain threshold (varying by service). Splunk alerts enable a quick reaction and ensure you are the first one to know about service health changes.

Service Ownership

Feature development and technical debt go hand in hand, we allow 20% bandwidth to be spent on technical debt/service ownership related tasks in each sprint. Each service owner works with relevant stakeholders and ensures that the service tech backlog is groomed and prioritized.

Service Categorization

All Flight related services have been categorized into different tiers with multiple service owners located across the globe, enabling us to act fast and respond quickly in case of any incidents and outages.

Silver Services

  • These services do not have direct booking impact.
  • Silver Service Owner — Typically have one or more owners and are accountable for service’s code quality, architecture, deployment and automation practices and general technical excellence along with providing Business as Usual (BAU) support required for the service.
  • Silver services are generally owned by an individual or individual team and there is no regional single point of contact (SPOC) for them.

Gold Service

  • These are Tier 1 Services which can cause booking loss to Expedia Group.
  • Gold Service Owner — Typically have one or more owners and are accountable for service’s code quality, architecture and deployment.
  • Regional SPOC — In a case when a service needs 24x7 support coverage, there are dedicated Regional SPOC’s whose sole responsibility is to provide expertise for the service in case of live-sites to restore health of the service with-in the defined service level agreements (SLA).

Platinum Service

  • Few Gold Services which are too complex for a single owner to support and/or are worked on globally by multiple teams within Flights are tagged as Platinum Services.
  • Pod Leads are the owners of the specific Platinum service to which they are aligned and run in co-owners model, having at least one person from each pod working on it (minimum total 5 co-owners for each service).

All these really helped us in achieving the operational excellence, thus making sure all services are healthy, are scaling well and reducing overall customer impact.

Learn more about technology at Expedia Group

--

--