Our first Redis Nose Day

Comic Relief Engineering
Comic Relief Technology
4 min readApr 3, 2017

We changed a few of the services that support our apps for Red Nose Day giving pages before the big day this year.

One of the more interesting processes was finding a good solution for all our caches and session data. In many cases, we found a Redis service worked well with our Cloud Foundry apps, especially when data has to be shared.

The move to Redis wasn’t entirely seamless, but it worked well for us on the night, and has taught us lots. This is a quick summary of what we’ve found so far.

A service within Cloud Foundry has great automatic benefits

We got some easy wins by using the Redis service that Pivotal kindly gave us use of.

  • There’s no service for us to deploy. Provisioning user-provided services for your Cloud Foundry apps is inevitably more work.
  • Services fit our Cloud Foundry spaces automatically. We get an instance per space without juggling extra configuration, with credentials handled automatically and less scope for human error.
  • By living inside PWS the Redis service has been privately networked from the start, simplifying the process of keeping our data secure. With external services, we have to give encryption and authentication additional thought.
  • The services being managed and deployed in an identical way with configuration close to the platform also gives more confidence that each instance will behave the same.

Shared services have caveats too

Similar behaviour from every instance has a flip side. Every Redis deployed in the standard way currently has the same default virtual machine size and set of configuration options — although these are both things which will become more flexible soon with improvements to the platform.

We know our traffic has huge spikes coming up to a campaign, so we take load testing pretty seriously. It was when we threw a few 10,000s of simulated donations at our new apps that we realised Redis was configured to keep everything forever, and didn’t handle running out of space as a result very well.

The bad part of this is we couldn’t easily change the service configuration for our Redis instances. We talked about a custom deployment of the service, but this was hard to do in time and would mean more unpredictable deviation from the standard service. We also couldn’t expect instances to always be long-lived, as they can get swapped around during platform maintenance. Again, this is something which will become more predictable soon with further PWS improvements.

Even before these platform changes, a great thing about Cloud Foundry is that adding apps is ‘cheap’, and this can make working around this kind of thing in the short term much simpler. We were easily able to put together a single-purpose utility app that updates our configuration whenever it doesn’t match the cache eviction policy we need, i.e. whenever the service instance is new.

Every 2 hours we just spin up this app and bind it to every Redis we care about. This way we know we’ll never go more than 2 hours without our cache evicting as we’d like. We wrote the bulk of this app in less than a day. You can see its source here to get an idea what’s involved in bootstrapping something like this — the key logic is only a few lines.

Finally, essential maintenance timing wasn’t fully under our control. Pivotal worked with us to avoid major disruption around Red Nose Day, but some disturbance from maintenance is hard to avoid with this kind of shared infrastructure. In addition to Pivotal’s other options like single-tenant Cloud Foundry on PWS-E, there are changes planned on PWS Redis that will also reduce unexpected downtime. So we’re confident we can find a reliable option within the Cloud Foundry ecosystem for Sport Relief 2018, where giving pages are likely to be used even more.

Coping without it

Maintenance on the Redis service did get us thinking more carefully about what we actually need from our Redis instances, and how we should react if they’re down. This got us a few technical wins and more resilience than we had even with our old infrastructure and memcache.

  • We now use more local caching (e.g. apcu) where it does the job and is faster than a shared cache.
  • We got rid of sessions where possible, like in our payment handling app. Session logic here was saving us a few lines of code, but also added a big service dependency with no real functional gain.
  • We handle outages much better now. Previously even pages with no need for sessions would crash spectacularly if Redis couldn’t bootstrap.
  • We built a session-free fallback for our main donation journey. We want sessions here when things are working smoothly, because they help us reload data that sponsors would otherwise lose after a payment failure. But now if Redis is out, we automatically fail over to a version that can’t provide this extra feature but still works the same for most donations.

Moving forward

Using Redis for this Red Nose Day has been a great step towards better understanding the relative benefits of Cloud Foundry native services in general vs. external ones. It’s important to think about how much control you want, and for future apps we’ll be thinking about our service requirements in detail, earlier.

We’re also excited that Pivotal Cloud Foundry Redis will soon be getting major improvements in configuration flexibility and how updates are deployed. This will mean increased reliability, and also make some of our workarounds redundant, allowing us to keep our apps simpler.

Having successfully used both types of service this time around, it looks like Cloud Foundry can give us the flexibility to keep making the right call for each one as we progress.

--

--