Too Fast For Our Own Good

Technically our site isn’t as fast as it used to be as a whole according to our New Relic monitoring, but certain parts of our new architecture at Bluestem Brands are substantially faster than they used to be. Very specifically our broad use of Redis as a cache, session store and generally very fast general hammer… except when it isn’t.

Redis And Grails

I have been a contributor and am now the primary owner and contributor to the Grails Redis plugin after having converted it to support Grails 3. During this conversion I had a fleeting thought that I should refactor the usage of the KEYS command in the redis service which looks like the code below. Take note that the comment for the method. Apparently someone can see into the future.

Session & Cache Invalidation

We put a lot of keys in cache for a user and from time to time they do something that needs to invalidate a whole bunch of data. We were very clever and designed our key structure in a way that clearing all customer cache was as simple as doing something like “delete all keys in the pattern [systemId:customerId:*]”. This means that we could just use the method above to purge all customers keys. I felt pretty proud of this design and it worked really well… for a time.

Our System Started Slowing Down

The more features we rolled into our new site and converted from our legacy system, the slower the system started to get. At first we chalked this up to some inherent slowdown in using Grails and Groovy versus our previous Spring and Java stack. That is until we started seeing massive slowdown spikes on the site that couldn’t we couldn’t really attribute directly to the framework.

We did notice that from time to time, and not on a predictable interval, there would be spikes in our calls to get some content from Redis. I connected directly to the read-only slaves and all my get commands returned in sub millisecond times. Basically on a dare, I also ran against master and all of a sudden every N number of calls would take 200, 300 or even 1000 ms. Silver Bullet, but WTF?!

When Redis Is Slow

Several months after our initial launch and many many site visits and a broad usage of cache and we now have millions and millions of keys in our redis cluster. Based on all my understanding and usage of Redis this seems like a small drop in the bucket compared to other people’s usage (see Intagram).

I googled a few minutes things like “slow redis” “bad redis performace” and “wtf redis”. It was actually when I was RTFM that I found the SLOWLOG GET command. I ran that and all the results were things like KEYS FOO:*, KEYS BAR:*, etc and they were taking many hundreds of milliseconds to run. On top of that the KEYS command does some under the cover blocking type operations. It may be fast when there isn’t a huge load, but we are running about 5,000 commands/sec and those key scans put a huge strain on the system.

Short Term Fix

The easy solution was to only allow deletes by actual key and not a pattern and only on a single key at a time. So at the cost of having to iterate over a pre-generated set of keys, which is a few milliseconds, we now only do key deletes using the actual Jedis delete command and not by scanning and iterating the matching keys first.

Long Term Fix

The deleteWithPattern operation is actually pretty useful with a good key pattern design. I will be moving the Grails plugin to use a SCAN operation instead of a KEYS operation to increase throughput on the operation speed.

The System Today

We saw a 50% better measurement of system performance (300ms to 150ms) once we cleared up the KEYS issue. We are still slower than the old system by 100%, but in my opinion the benefits we gain in extensibility and ability to leverage a single platform to serve any number of multi-tenant sites are worth some cost in JVM performance. Especially if the system metric is different in tens of milliseconds (~75ms vs ~150ms). The real issue in performance is the ~10 seconds it takes to render the DOM with all those 3rd party scripts we have on our site. That, however, is a completely different issue that we are making progress on every day.