Lightning Fast Flight Searches on Expedia using Apache Ignite — Part II
The infrastructure, configurations, design choices & setup
In Part 1 of this series, we talked about our problem, scale, data model & challenges leading to the design choices for using Apache Ignite to power the caching layer for flights. As earlier promised, in this post, we’ll talk about the configuration, deployment, maintenance, pitfalls, and best practices in our journey of migrating to an Enterprise Apache Ignite cluster.
Our goal was to set up a dedicated cluster of Ignite server nodes on the cloud, one that could scale according to the traffic at Expedia Group™️ seamlessly. Server nodes are the ones that store data and to optimize throughput & latency, we wanted all data to be stored in memory without introducing any native or third-party persistence.
Flights data is very dynamic since the availability of flight seats & pricing information could potentially change every minute (or even every second), the cached information could therefore go stale pretty quickly. Moreover, for any route, there can be thousands of possible flight combinations meaning the data size is huge. For instance, it’s common to see up to 10k unique flight combinations for return travel between New York to Miami on any given date.
Some of the key things that we looked at whilst coming up with the high-level configurations for the cluster were:
- cache read/write patterns
- the data size for each entry
- the time to live for each entry
- overall memory requirements
- network constraints and
- providing adequate buffer
Since we wanted to deploy this on the cloud choosing the right-sized memory optimized EC2 instance was very important since we wanted to keep all cache data strictly in memory. Moreover, the server-side remote computations needed to fetch complete data from different caches in memory too before processing, so a sizeable heap was also required.
At the time of developing this, we used the latest version of Ignite which was 2.9.1. There are a lot of new features & improvements in Ignite 3 and we’d like to move to that version in the future and follow it up with another post on the migration journey.
Next, instead of directly running Ignite binaries and managing configuration natively, we embedded Ignite within a Java service bootstrapped with Spring Boot. This enabled us to easily add our custom REST endpoints to interface various debugging & housekeeping operations like custom health checks et al. It also enabled us to easily manage hierarchical environment-specific configurations and integrations like Vault.
Ignite is highly configurable & provides a plethora of configurations to tweak the system based on your needs. We used multiple caches, created at the server, specifying the configuration of each one at the time of creation. Here’s a set of configurations that we think are the most useful:
Cache Mode: PARTITIONED
Cache atomicity: ATOMIC
Data Storage: Only off-heap memory
Thread Pool Sizes (Public Thread Pool, Query Thread Pool, System Thread Pool):
- Default: max (8, total cores)
- Configured: thrice the default limit
Data Region Size (initial & maximum): 20% of available physical memory
We configured the min and max to be the same as we didn’t want to add any overhead for new memory chunk allocation. Running remote computations on the server needs on-heap memory so we ended up allocating a sizeable chunk of physical memory. We ran some quick performance tests and saw that the ratio of 60:40 (off-heap:on-heap) worked best for us.
Time To Live: The TTL is specified via ExpiryPolicy and this could be configured per cache. After the specified time has passed, the entry becomes stale & the memory is marked to be used again. Our different caches use different TTLs depending on the use case.
Page Eviction Mode:
- Default: Disabled
- Configured: RANDOM_2_LRU
Note: you may run into OOMs in production if this bit is not configured correctly.
- Default: 90% for maximum utilization of off-heap space
- Configured: 80% to guard against JVM Out Of Memory errors (OOM) due to spikes in traffic which are not uncommon for the scale of Expedia Group™️.
- Default: 4096 bytes
- Configured: This needs to be configured based on your use case & entry/row size.
Empty Pages Pool Size: This specifies the minimum number of pages to be present in the reuse list and, coupled with page size, makes up a tricky configuration and could lead to production outages if not done correctly.
It ensures that Ignite will be able to successfully evict old entries when the size of an entry is larger than page size/2. The total size of pages in this pool should be enough to contain the largest cache entry. The default is only 100.
Also, as a general rule of thumb, this configuration should be less than maxOffHeapSize/Page Size/10.
Reads from Backup:
- Default: Read from the backup node
- Configured: Turn off backup node reads since we have a complicated use case where in the read clients are continually waiting for data from multiple caches. Turning off this configuration avoids those clients reading from a stale read replica.
Affinity Key: We are using affinity key across caches to route grid computations to the nodes where the data for this computation is cached. This co-location leads to zero hopping across the nodes on the server-side and has given us great results in reducing latency.
The first thing we needed to decide here was the flavor of the client to be used. Our client apps are JVM based and Ignite provides an option to either use Thin Clients or Thick Clients. Here’s a very quick brief on these options:
A thin client a very lightweight and initiates a single TCP socket connection with one of the nodes in the cluster, communicating via a simple binary protocol. They are very similar to the “client libraries” in a traditional sense.
A thick client is a regular Ignite node running in client mode. Thick clients become a part of the cluster and receive all of the cluster-wide updates, such as topology changes, etc., and can direct a query/operation to a server node that owns a required data set.
For our use case, we chose thick clients due to various reasons:
- One of our primary use cases was for a lot of client calls to wait for the server to notify them of an event. This is powered by Continuous Queries which is only available with thick clients.
- Compute grid functionality is very limited with thin clients & this is one of the pillars of using Ignite as explained in the previous post.
- As mentioned above, since a thick client is topology-aware, whenever a thick client sends a request to read or update a cache entry, it goes directly to the node where this entry is stored. This co-location is very efficient from a scalability and latency standpoint.
- Features like near-caches & data streaming are much less likely to be built for thin clients.
So, we ended up using an embedded Ignite node running in client mode (a.k.a thick client) in our service.
Some of the things to consider while configuring the client would be:
- A thick client is much heavier than a thin client in terms of CPU & memory requirements. We needed to revisit and vertically scale the client service after embedding the thick client.
- A thick client requires full connectivity with every single server node. This connectivity needs to be established both ways between client and server. So, if there is a firewall, NAT, or a load balancer between the client and the server, it will be really hard or even impossible to use thick clients in some cases. We’ll shed more light on this in the next section on Network Discovery.
- Our clients never create new dynamic caches. They always use the
getCacheAPI to get hold of cache objects created by the server. This helped to keep the client configurations at a manageable single place with the right expectations.
- Thick clients do not provide out-of-box support for resilience when the cluster is down. We needed to add custom code to make sure that Ignite server outages do not block the clients from bootstrapping and serving other use cases.
- Make sure to provide the timeouts for async queries & remote compute jobs.
- Wherever possible, make sure to provide separate dedicated thread pools to handle callbacks to queries & compute results.
- Integration with
Reactiveapplications are not seamless and would need some dedicated time and effort since all Ignite async APIs return a custom future object (IgniteFuture) which is not entirely compatible with Mono/Flux.
This is the process by which nodes discover each other and form the cluster. Server nodes need to discover other server nodes whilst clients need to discover servers. They all exchange information and arrange themselves in a ring. Note that since we used thick clients, clients also become a part of the cluster but they neither store data nor participate in any compute tasks.
Ignite provides a lot of ways (both static and dynamic) to configure this discovery. We operate in a cloud environment with continuous delivery so we chose a dynamic way to configure discovery. We started with an ALB based discovery finder where healthy clients and servers would register themselves with a TargetGroup attached to an ALB. Each node would then query the ALB using AWS APIs to get information on the current set of nodes in the cluster.
We faced multiple problems with this approach:
- Our clients were containerized leveraging Amazon ECS with dedicated EC2 machines as the workhorse, bridged network mode was used. However, bridge mode results in port translation. So, it was possible for clients to connect to servers but servers couldn’t ever connect back to clients. We were able to resolve this problem by moving to ECS on Fargate which used AWS VPC mode. This mode gives Amazon ECS tasks the same networking properties as an EC2 instance with a dedicated ENI, resulting in clients and servers discovering each other.
- AWS rate limits update and query API calls on Target Groups. We used exponential backoffs which meant the time for nodes to discover each other increased leading to connectivity timeouts and fallback measures kicking in. This put extreme stress on the cluster which was evident from the spikes in latencies, cascaded node failures, etc.
- Default discovery creates a ring-shaped cluster. The total number of nodes (clients + servers) easily went into the 100s and as the ring size grew, it took multiple seconds for various housekeeping messages, like node joining, to traverse through all the nodes. This, again, affected the overall cluster performance and stability.
With these problems, we finally moved to the zookeeper based discovery and would strongly recommend using this as the primary discovery mechanism for everyone in production, irrespective of the number of nodes in the cluster. We stood up a zookeeper cluster and used
ignite-zookeeper on both clients and servers for discovery SPIs. The zookeeper cluster acted as a single point of synchronization and rearranged the ring-shaped cluster into a star-shaped one, where the zookeeper cluster would sit in the middle and all Ignite nodes exchange discovery events through it. The zookeeper IPs were put behind Route 53s so that clients wouldn't need to be configured again in the case of the zookeeper cluster going down. You could also use dedicated ENIs with static IPs and attach them to the new node if a node goes down. The zookeeper cluster was minimal but covered all Availability Zones for resiliency.
Deployment and Maintenance
Nodes are initialized using the Spring Boot extension. This allowed us to easily bundle our custom code (like swagger endpoints for read/write, etc.) along with the server. The Spring Boot properties also allowed us to easily specify different configurations per environment (production, staging, stress-testing).
We use Jenkins for continuous integration and Spinnaker for continuous development. Releases are done following Blue-Green deployment practices since we’re primarily using Ignite as an L2 cache. We don’t have to add nodes progressively and wait for replication to happen since the data is per user and not long-lived. The relatively short TTLs have allowed us to simply discard the data during releases since the new servers will be hydrated with fresh data.
The cluster is deployed in AWS cloud, so autoscaling is managed inherently using CPU monitors and memory usage monitors on nodes.
Ignite exposes a lot of metrics natively which are for local nodes and clusters as a whole. Some cache level metrics and data region metrics can be enabled on an as-needed basis since not all metrics are enabled by default.
We are using Datadog for metrics dashboards, monitoring, and alerting. Datadog comes with an out-of-box integration for Ignite already. The Datadog agent can be deployed on Apache Ignite boxes which collects node and cluster-wide memory, cache, and storage metrics and logs. Apache Ignite exposes JMX metrics, and the Datadog Agent connects to Ignite’s JMX server to capture performance data from both Ignite and the underlying JVMs.
So, this is how we were able to set up a complete Apache Ignite cluster in production, serving more than 3k transactions per second with new caches and use cases continually being added.