Authors: Mike Moore, Nikash Ray, Venkata Naresh Bathula
What is Redis
Redis is a high throughput low latency in-memory data structure store (supporting Lists, Sets, Sorted Sets, etc), which is often used as a database, cache, and message broker. We use Redis Labs — Redis Enterprise software in all of our data centers. Redis is used broadly across Groupon calling synchronously (blocking) and asynchronously (non-blocking calls by adding tasks into Resque).
Consistency vs. Availability
For most flows at Groupon we slightly favor availability (“always-on”) over consistency (CAP Twelve Years Later: How the “Rules” Have Changed). We determine the consistency needed from a business standpoint per endpoint, (e.g. a deal in the cache is consistent enough if updated within 1 minute of a content change; however an inventory change must be much more consistent to avoid oversell/undersell so we take steps to ensure consistency is sub-second).
Goods: Online “Deal-Show” for Goods Inventory Service (GIS)
www.Groupon.com is a high traffic e-commerce site which puts a premium on providing a low-latency web and mobile customer experience. All the calls from the UX tier come to the API gateway (Gateway Aggregation pattern) which for each deal page synchronously and in parallel calls dozens of downstream services (e.g. deal information, option pricing, option available inventory, taxonomy, etc.) in what is called a “Deal-Show” call. It then aggregates returned results and passes a unified response on to the UX tier for display on a web page or mobile application. These calls are synchronous, making it essential that every service being called adheres to a strict performance Service Level Agreement (SLA). For the Goods Inventory Service (GIS) this is on the order of 10,000 RPS @P99 20MS latency. This SLA must be met regardless if a given deal has 1 option (~10KB payload) up to 100 options (~1MB payload).
Example Apparel Deal with 80+ Options
Prior to leveraging Redis to cache deals & their respective options, Deal-Show calls returned within SLA with 1 option but took 1000MS+ for 100 option deals. As our supply variety blossomed, the average # of options/deal dramatically increased as did the total Deal-Show Latency. Our code was directly calling the RDBMS DB Read replica in the local data center (avoiding cross-data center further latency increases). To combat the linear delay increases with a greater # of options, we changed the application query logic to run option queries in parallel and then aggregate all the results into a single response thus reducing total latency. However, retrieving Deals with a large # of options (~100) was still taking 100MS+ which was over our SLA.
Redis Cache Solves our Latency Issues with 100 Option Deals:
To meet our 10,000 RPS @20MS P99 latency SLA, we turned to clustered REDIS which uses configurable internal sharding (6 shards met our needs) to return data requests at exceptionally low latency. With a classic normalized SQL database, our large deal (100 options) response had P99 lengthened from 20MS to 250MS+. Using REDIS MGET for 1 option had a P100 of .1MS. Impressively returning a deal with 100 options added only a tiny .01MS for a total P100 of .11MS easily exceeding our performance SLA. We are now positioned to support deals with hundreds of options in the future.
Balancing Consistency with Availability (CAP)
For READ operations, we use a classic Cache-Aside pattern of reading from Redis cluster. This cluster is co-located in the same datacenter as the application host serving the request currently first, and if no value exists, calling the underlying data store. It then materializes a view of all the options on a deal and synchronously writing updates to BOTH of our North America (NA) data centers. In this way, we are not only saving round trip time between the 2 NA data centers for most of the Redis MGET operations but also achieve predictable consistency with a write operation within ~25MS P99. When an option gets updated GIS also writes to BOTH the data center Redis clusters (write-through pattern) so that subsequent reads get the updated data.
One key question all distributed systems should answer: “How long is eventual in eventual consistency”? Prior to synchronously updating our cache copies (simply evicting keys) as part of every WRITE evicting the cache resulted in ~400MS “hole in the cache wall” where all incoming customer requests (for an evicted deal) experienced overlapping cache misses. This resulted in an intense surge of duplicate RDBMS queries creating latency spikes pushing us far beyond SLA and dropping our cache hit ratio below 99%. Using a write-through pattern to synchronously update cache keys to match the RDBMS resulted in a cache hit ratio above 99.9%. The synchronous cache update ensures cache keys are remarkably consistent with the RDBMS making it practical to lengthen Redis key TTL to 12 hours + random (12 hours) so each key is evicted once/day ensuring 100% of the cached keys are refreshed daily. Extending TTL further alleviated RDBMS query spikes and resulting P99 latency increases. The diagram below shows Redis Architecture in use.
Getaways are the online travel shopping line of business (LOB) for Groupon. Customers can book a stay for a destination from hundreds of hotels. On the deal page (depicted below) the customer can choose a rate plan/option along with the preferable check-in/check-out dates in the calendar. Once the user clicks on the “continue” button the UI/UX calls Groupon API Gateway which in turn calls POST /inventory/v1/products/<option id>/availability against Getaways inventory service (GtIS).
Availability API and Redis
Once GtIS receives the /availability request it checks whether inventory is available for the option for the selected check-in & check-out date. It also checks for any applicable restrictions, price variation, cancellation policy variation, change in Forex conversion. It creates a dynamic entity called HotelProduct which gets saved into Redis with a TTL of 1hour. In simple terms, we can think of this process as adding a product to a shopping cart that the user sees on the checkout page (depicted in the diagram).
The HotelProduct gets written to Redis with the key being productDetailUuid and the value being JSON format of HotelProduct.
Sample HotelProduct formatted in JSON
Unlike Goods Inventory Service, GtIS uses a Regional-consistent Redis cluster which has the benefit of having a consistent view of the data to the application hosts irrespective of which data center (we have two data centers for NA) they belong to. In regional-consistent Redis cluster architecture, clients connect to a local Redis proxy. The proxy then connects to the appropriate “leader” shard, which could be in any datacenter. Proxies connecting to “leader” shards ensure consistency regardless of the client location. In this use case, we sacrifice additional latency (5ms additional latency incurred when a proxy is in one data center and the corresponding shard is in the other data center) for consistency as the subsequent call to reserve (when the customer clicks on “Place Order” in the checkout page) can reach to any data center and the application host has to fetch the HotelProduct from Redis with the productDetailUuid for further processing. The below diagram depicts the architecture including GtIS interaction.
Overall, Redis in Groupon has played a key role in building high throughput, low latency systems. It’s now part of Groupon’s core building block and is being considered for many more new use cases.