How Redis stole Discounts
By the time this gets published, it will be some good time after Christmas, but some stories are timeless. Incidents are never a cause for joy, but they hit in a different way when they affect something that normally is associated with ‘wins’. We’ll go over how one such source of joy was ‘killed’ by Redis (spoiler alert: it’s us, hi, we’re the problem) and learnings we took from it such that 20% off burgers are never off the menu again — literally.
Who doesn’t love discounts?
‘Discounts’ is one of the most important and, dare we say, beloved services in our current web of microservices, by customers and staff alike(hello, free deliveries perk!). It’s tasked with providing the available discounts in various screens of the application. Following a user’s journey inside the app, we see discounts displayed: on the front page in the feed, where stores are incentivised to offer attractive discounts that contribute to their ranking and customers choose accordingly, inside a store, where discounts incentivise users to choose certain products and then on checkout, where discounts are being recomputed based on changing product selections and quantities. Since these flows are generated in different parts of the system, they have pretty different reliability and availability characteristics. Checkout(pricing) is, of course, the most business critical, so a place where we can employ retries and have slightly more lenient timeouts to maximize the probability of customer conversion, whereas Feeds sees the most traffic being on the front page — low timeouts are more appropriate here for a fast response.
How do we spread the Discounts love through Redis?
Inside Discounts, we mostly use Redis to cache discounts offered by stores (think a 2x1 for a hamburger, or a 20% discount on a pizza), since the promotion campaigns generating these discounts are generally long-running and don’t change very often.
This rather static nature of discounts, with a very high ratio of reads to writes, makes them a perfect candidate for caching.
We use a managed version of Redis, more precisely Elasticache by AWS, running in cluster mode.
Since discounts are offered at a store level, the key must include the store identifier.
Our first scaling challenge was around serving discounts of up to 4k stores in a single request, since discounts are being taken into account in the feeds ranking algorithm and big cities like Barcelona and Madrid can have thousands of stores. We opted for a Redis multiget here to get all the necessary discounts in one go. This came with its own challenge of still having the client-library make multiple network calls to different shards as not all stores would be found on the same node.
We had to find a way to group stores so they would all be mapped to the same shard. Since the list of stores a user is presented with is dynamic in nature — based on the user’s position, sponsor running promotions etc — the only logical grouping/common denominator for them would be the city. As such, our key became a composite of city and store, with the city representing the key’s ‘hash-tag’, meaning all discounts for stores in a certain city would be found on the same shard and even a 2k multiget would be fetched with a single network call. Easy!
Too much of a good thing can turn into a bad thing
Things worked fine for a while and we were happy with our service performance — a p90 of <20ms with 1k rps at peak times.
Discounts was one of the first services to be extracted out of the monolith, but our consumers were quickly following suit and, as part of their migration, they were shadowing monolith traffic, causing temporary increases(2x) in calls. Since our consumers could now deliver faster, they also saw more adoption from other clients, further contributing to Discounts network calls increases. Finally, we were experiencing Glovo’s own organic traffic increase through more users adoption. Since on every HTTP call, we were making around 4 Redis calls, the celebrations came to an end and went straight into mourning when our service flat-lined.
Our latency had slightly increased over time, but during most business hours, we were still doing great. However, a strange thing happened on a Sunday evening when our high-latency monitors went off, our clients’ circuit breakers opened and discounts stopped being served in any of the normal flows.
No 2-for-1 pizzas to shut off Sunday Blues with a Netflix binge. No 20% off on the healthy shake prepped for starting the week off right at 7AM in the gym. No free delivery for staff :( .. The feed looked empty, devoid of color, much like the town of Whoville after the Grinch stole the decorations. Even worse, previously promised discounts were not actually present (pun intended) on checkout. The Grinch had hit us hard.
Following our runbooks, we temporarily increased our number of database readers, as that had been our bottleneck in the past before caching, but saw no relief. Finally, after tinkering with some feature toggles and redeploying an older version of the application, things seemed to stabilise.
Next week rolled around and looking at our metrics, there was no obvious culprit.
Hunting down for the joy-killer
Database connection pools were not saturated, latencies and CPUs on both database and Redis looked normal. What happened? We saw an increase in the number of threads around the time of the incident, but could not reason whether that was the cause or the consequence of the incident. Application instances were renewed so no way to get thread dumps anymore.
We suspected a connection leak in our service’s SDK after not closing the response in a non-200 response so tweaked our code around it and adopted the new version in all our clients.
During the week we saw no other incidents, nor big increases in latency, but on Sunday, around the same time — but not exactly the same to suspect a rogue poorly batched scheduled task for example — we started seeing the same symptoms.
With the experience from the previous week, the team went for the same sequence of actions to at least mitigate the issue, hoping we would now find the issue and fix it.
With a relatively young team and service, we already knew we had a lot of space for improvement when it came to observability so we spent the next week adding metrics and debugging logs around all our integration points. We were still blind to the root cause, but knew more or less how to mitigate it, so we anxiously waited for next Monday to dig into it again.
The Grinch is Red(is)
Finally, we saw that we were experiencing Redis timeouts, without any big latency increases being reported on the server side, so we focused our efforts on fine-tuning the settings in our client library, Jedis.
Changing from our mostly default Jedis settings, we increased the number of connections in the pool to allow for more traffic at peak times, plus reduced the timeouts and retries to allow for more graceful recovery, should the same thing happen again.
We also added an action to our runbook to bulkhead our most critical business flows, keeping discounts on checkout page and individual store pages, but temporarily stopping from serving them in the feeds screen.
Too many celebrations and one full house
This was still not enough, even with an extra step of upgrading Jedis, but we were reacting faster and zooming closer and closer to our real bottlenecks.
Looking at the Cloudwatch metrics of our AWS Elasticache instances, we saw that in addition to having the ‘Network allowance bandwidth exceeded’ spike multiple times (that was also happening during week days), one of our two shards was also doing a lot more work than the second.
Our problem was two-fold: first we had to reduce the network traffic and traffic spikes, then to better distribute the load between the two shards.
Could we make nice with our Red(is) Grinch?
We approached the first problem at multiple levels: with the added observability in our clients, we saw there were business flows calling our clients that did not need discounts, so we kindly asked the teams owning those flows to update their calls to more specialised endpoints of our clients.
We also added a jitter to our TTLs, to avoid and limit en-masse cache refreshes that would spike our allowance exceeded metrics.
We removed some data that was frequently requested but was well-indexed in our database and didn’t need to be cached on Redis.
We then decided to change our client library from Jedis to Lettuce for two main reasons:
- We could now read from replicas (eventual consistency was fine) in our use-cases and relieve some pressure from our main Redis
- Connections could now be shared between threads, allowing more throughput and limiting the blast radius for one problematic request
Finally, we changed our Elasticache server instance type from a burstable to a non-burstable one to have a better and stable network bandwidth. Word of caution here — always read the fine print on Network bandwidths from AWS docs, as ‘Up to xx Gbps’ is a very different thing from actual baseline network bandwidth.
With all these changes in place to attack the first issue, we then looked at our shard imbalance. Remembering that we were co-locating discounts on the same shard based on cities to be able to use multigets, we predictably ran into a ‘hot-keys’ issue, made worse by the fact that some of our most popular cities, Barcelona & Madrid, ended up on the same shard.
Since we still had limited time and basically running in one-week-sprints, we first went for the shorter mitigation on this issue — logically replicating our hot-keys on all shards.
Basically, for keys we explicitly identified as ‘hot-keys’, say Barcelona in our case, with any write commands coming in, we would write them with their original key, same as before, but then also with a slightly modified key, so the same content for one entry would be written in all shards. On reading a key from Barcelona, our application would randomly choose between the original and modified key(s) mapped to Barcelona, and thus read from any shard, instead of just the one. This did increase the number of writes slightly, but our writes were already the only operations still performed by the main nodes since our separation of reads and writes with Lettuce.
Christmas, err Discounts, is saved!
Finally, with all these changes in place, we saw a reduction towards zero of our ‘allowance exceeded metrics’, 100% of our reads were now being handled by our replica nodes, and the reads between the shards was now closer to 50:50, see graphs below. Next Sunday (and Sundays after it) we saw no more timeouts and the latency remained steady around peak times too.
Our Red(is) Grinch came around and partook in the town celebrations while we saw steady traffic increases with no more side effects on availability. Our work Fridays were enjoyable again and 9PM discounted ice-creams were back in fashion (they really are some of the more popularly ordered products on Sundays in Spain!).
Switching to a more serious note to better drive the points home, here is a recap of some of the most important lessons we got through this long drawn out incident:
Know your tools
We were mostly treating our Redis server as a faster, distributed store than the database and using our client library with default settings. In a closer read of Redis documentation, we found out Redis server keeps connections open forever, so for rogue clients where a connection remains stuck, the only possibility to drop it from server side would be to set a server timeout.
We finally learned which were the default settings for timeouts and retries in our client library, then found out the Spring connector for Jedis we were using to abstract the Redis operations via RedisTemplates did not support replica reads, which led us to another popular client-library, Lettuce, that could do it.
Know your metrics
Digging more into AWS Elasticache docs and the suggested monitoring for it, we found out the CPU% should rather be read(and monitored!) as multiplied by the number of cores, so a 30% utilisation on a 4 core machine would definitely be problematic.
As mentioned before, the ‘Up to xx Gpbs’ AWS network bandwidth from the main docs should be treated as an upper limit that can be sustained for limited time, so the real measure of available network bandwidth is the baseline instead, which is a bit trickier to find.
AWS Cloudwatch metrics tell one side of the story, but equally important is the client-side of things, which could be telling a different story altogether. On this, one should look at adding and monitoring on client-library metrics. This was done through exposing the apache-commons pool Jedis implementation is using in JMX and then exporting these metrics to Datadog. For Lettuce, one can either use the built-in metrics or collect them through Micrometer.
Finally, we can add custom metrics that shed some light on the patterns of usage to answer questions like: how many keys are we usually requesting in our multigets? Are the bigger ones coming from certain business flows?
To this end, your clients are also your clients’ clients, so double-check (with new metrics, if needed) if all business flows reaching your service actually make sense and are needed.
Continuously review data structures & partitioning
Watch out for hot-keys and continuously monitor the distribution in your shards. We solved our imbalance for the moment by duplicating the writes to all shards, but the solution we will be working towards is preventing keys from becoming hot, either by splitting them into smaller chunks or using better Redis data-structures like hash-maps.
Don’t become cache addicted!
A final word of caution is we learned the hard way that one should not cache prematurely and balance the workload put on the caching and database layer to prevent an application from becoming ‘cache-addicted’. The consequences are ensured downtime on any issues to do with the caching layer, from network issues that are impossible to prevent, to human errors like serialization problems that can lead to a spike of cache misses.