EXPEDIA GROUP TECHNOLOGY — SOFTWARE

Lightning Fast Flight Searches on Expedia using Apache Ignite

How we made Expedia’s flight search 10x fast with Apache Ignite

rohit goel
Expedia Group Technology

--

Image with stars and mountains at dawn
Photo by Vincentiu Solomon on Unsplash

Expedia Group is a leading global travel platform. Flight searches and bookings are a large chunk of user traffic on our platform. For any route, there can be thousands of possible flight combinations. For instance, it’s common to see up to 10k unique flight combinations for return travel between New York to Miami on any given date. To create a unique yet coherent experience for the users, we slice & dice these combinations and show a best-differentiated subset to the users based on the user’s current position in the shopping funnel. The combinations are also affected by individual’s preferences and interactions in their current session.

Flights search is traditionally a read-heavy system with peak search traffic going regularly in the range of thousands of transactions per second. There is a latency overhead associated with each request which goes all the way down to the supplier. So, it is natural to use some sort of short duration caching of requests to provide the best experiences to our users. This results in a significant cache hit percentage. Our caching solution was powered by Apache Cassandra and it had been serving our requirements for many years.

Here’s a high-level summary of how our Cassandra setup looked:

  1. All the flight combinations for a unique search are wrapped up in a java type.
  2. This huge blob is then serialised using a Fastinfoset(compressed XML) serialiser.
  3. It is then compressed using gzip to be sent over the wire for storing in Cassandra.

Cassandra is super-fast, the 90th percentile of fetching this blob is 30ms. So far so good.

Where is the problem?

Even though the Cassandra fetch times are swift but the 90th percentile of the overall response time of the flight search service was more than 3 seconds to display results from the cache to the user. This is not ideal for any cached response and somewhat defeats the purpose of caching. As we work towards our goal of becoming the world’s best travel platform, we continuously strive to improve the shopping experience for our customers.

As we started investigating the reasons for the latency, we could immediately identify the following problems with the current setup:

  • Fetching the gzipped blob was fast but decompression of complete response took ~1100ms.
  • Deserialisation of this response took even more time ~1500ms.
  • Time taken by Flight search service for internal processing~100ms
Flight search service p95 processing time

We started exploring different ways to optimise this and some of the first few solutions we implemented were:

  • Reducing the duplicate data stored in the blob in Cassandra — resulted in a 20% improvement
  • Using better serialisation mechanisms like Protocol Buffers — resulted in 30% additional improvement
  • Different compression algorithms like Kryo — no change

We did manage to reduce the overall latency with these improvements but it still took more than a second to get the results end to end. We still wanted to improve it further, so we decided to go for an entire rewrite of the caching layer with the goal of achieving sub-second service latency.

Architectural Requirements:

  1. Cache should return filtered flight combinations so we reduce network payload and deserialisation time.
  2. Similarly, compression of complete response should not be mandatory.
  3. Read operations from cache should take less than 50ms
  4. Cost should be reasonable considering the scale of flight search traffic
  5. The data model should be as close to the final response as possible so we don’t need to loop on flight combinations for mapping

Based on these requirements, it was clear that we needed a new data model & new ways of querying the cache. This was a good time to consider some of the alternate NoSQL databases as well. We focussed mostly on:

Among all these and a few more, two things caught our attention with Apache Ignite. The compute grid to offload some of the heavy computations on the server-side and continuous queries for server-side push events.

Solution to the identified problem

Apache Ignite is a distributed in-memory database for high-performance computing. Because of its in-memory nature, the read operations are lightning-fast. Apache Ignite implements JCache specification which is the standard caching API for the Java programming interface hence it is a very simple yet powerful API for data access.

Ignite provides various features that fulfil all our requirements:

  • Stores all keys and values as BinaryObjectIt provides us with the flexibility to change keys and values in the future without any breaking changes.
  • Lazy Deserialisation using BinaryMarshallerIt enabled us to read specific fields only from an object. Application code has the flexibility to break the value object in multiple fields and handle it individually.
  • Coherence using AffinityCollocationWe can instruct it to keep a specific set of entries together on a single machine across caches.

Over and above these, it supports remote computation i.e., we could just run our java code to do post-filtering and processing on Ignite servers where our data resides thereby making sure that minimal payload travels across the network.

It is also distributed under an Apache license, so no licensing cost was involved. The only downside is that there is no AWS managed solution for it yet. So, we have to manage the deployments, patching, and security ourselves.

Data Model

Flights cache high level diagram showing key and value
Apache Ignite Flight search data model

Service — Cache interactions

Flight search service to Apache Ignite interaction
Service-Cache Interactions

We’ll talk about the infrastructure & best practices for Ignite in the next post of this series so surmise to say that we managed to set up a distributed Ignite cluster to hold in-memory flight combinations with the above data model. We made sure that all the flight combinations (remember there could be thousands of these) related to a particular search across the caches were colocated on a single data node. Service would submit remote computation jobs to the cluster. Since the service runs an embedded thick client, it knows exactly which server node holds the data, so the remote job is submitted to the data owning node. The job would fetch all the offers, deserialise only what’s needed, run the computations to slice, dice & filter the best results and return a subset of these to be returned to the end-user. The data model closely represents the final response so we also save up on any unnecessary mapping too.

Circling back to the initial three major hotspots identified with the previous solution, this time the decompression & service processing was complete in ~40ms and the remote jobs were able to finish in ~100ms leading to a final response latency of less than 150ms.

Average Remote Compute Job Execution time

Graph showing job execution times
Apache Ignite Job Execution p95 Time

So, this was our story of revamping the caching layer for flight search service to achieve super-fast results. We would like to mention here that this is not a comparison of Cassandra vs Ignite. Cassandra is great but Ignite was better suited for our needs. Also, please stay tuned for the next part in this series where we’ll talk about the configuration, deployment, maintenance, pitfalls & best practices of setting up an Apache Ignite cluster.

Authors — Bhanu Choudhary, Rohit Goel, Varun Bhatia, Ramasubbu Subbareddy

--

--