AWS Summit London case study: Going Serverless at Comic Relief

This Serverless deep dive talk from the AWS Summit London was led by Danilo Poccia, an AWS evangelist for Serverless, followed by the Comic Relief case study of our experience implementing Serverless. I wanted to share the video along with some additional notes which I either forgot to say at the time or Twitter questions have revealed weren’t clear enough — hope you enjoy!

This talk will be covering the journey that Comic Relief has undertaken to adopt Serverless technologies as our default.

First, some context. Comic Relief is a charity with the aim to create a Just world free from poverty. Our flagship event is Red Nose Day — which is sort of a national takeover to come together in March to raise money and awareness for the projects we fund which are tackling some of the biggest issues of our time.

The enthusiasm of the public for our campaigns each year is incredible, but it also comes with some challenges — our traffic is predominantly driven by broadcast moments on the BBC — and as a team, we can never be sure what the response will be to an on-air calls to action. Where other e-commerce products have to manage considerations like stock availability, donations have no limit to the number we can sell — conversely, where a festival ticket sales site can keep users refreshing for hours to pay for their product, user’s who wish to donate do so in a highly emotive and impulsive moment — if their donation experience is disrupted, that impulsive moment may be lost.

Our annual campaign culminates in a 7 hour TV show which is the main driver for donations from the public, requests to our platforms can go from zero to tens of thousands in a minute — this year we peaked at 350 donations per second but our systems are designed to handle significantly more than that.

Scaling challenges — this graph shows our spikes in transactions over 4 hours of the TV show. Across our 3 donations mechanisms (online, SMS and call centres) we saw a peak donation second of 350 donations.

So, when asked “why serverless” — there’s an obvious answer around our inconsistent and unpredictable traffic, scaling when we need to scale and only ever paying for the hosting we use. But that wasn’t actually our motivation for trying out serverless.

To set the scene, let’s start with our 2016 campaign legacy tech stack — Before Serverless.

2016 consumer facing tech stack — one monolith and two independent products.

Our products could be relatively neatly divided into “the website” — a Drupal 7 monolith which handled almost everything an end user could want to do with Comic Relief — static web content, taking payments, SMS gift aid declarations, collecting images of fundraisers and a bunch of other stuff. The other 2 products which escaped the Drupal 7 platform were our Giving Pages — where users could set up accounts and fundraising pages — and our “Night of TV” donation platform which was a separate application due to its high availability and resilience requirements which you’ll hear a bit more about later.

2017 consumer facing tech stack — introduction of the Drupal 8 platform for CMS and microservices for pay-in fundraising and SMS giftaid. First serverless application the fundraiser gallery.

After the 2016 campaign, our team started on a Drupal 8 platform — I know, you’re at a serverless talk, why do I keep saying Drupal, please bear with — this platform was designed to be a flexible and modular CMS. One critical point to make here is that the decision was made to decouple products which collect user data from the Drupal platform. We built separate applications for paying in fundraising and gift aid declarations — both of which were built in Symphony — so still not serverless.

In the 2017 campaign we delivered our first serverless application to end users. It came about due to the requirement for a fundraiser gallery which had been continually de-prioritised and suddenly we had a very real requirement to deliver — fast. Luckily, we had a frontend engineer who had been playing around with React on the weekends and an engineering lead hankering to try out this new serverless thing. The application was simple, a form which users can add up to 5 images to, enter their email address and a checkbox with permission for us to use the image in a gallery on our website. We used Cloudinary as our quickest way to get a product spun up and within 2 weeks we had a viable application. We popped it in an iFrame on a content page of the Drupal site and it had thousands of users sharing their photos through it.

2018 consumer facing tech stack — serverless applications for discrete one-off product requirements.

The next step for us was to fully retire the Drupal 7 platform and, in doing so, needed a new way to deal with the contact us form, another piece of our legacy monolith that needed upgrading. It also meant getting more of our team adopting the serverless frameworks, React for our frontend and NodeJS for the backend. The speed of delivery was great and this combination became our go-to for delivering one-off discrete products that we didn’t feel too precious about. So, by the 2018 campaign, we had a fundraising gallery, a competition to design the next red nose, the contact form and a calculator for teachers to add their school’s steps, all running serverless.

Shared services across products allowed for rapid reusability.

By this point, we’d managed to identify quick wins in reusability, so functions like the serverless mailer for triggering transactional emails and our postcode lookup service were already being shared across products, even products which weren’t built on the serverless framework. Our frontend components were all being developed in Storybook — another time-saving method to save your engineering team repeating themselves.

Identifying elements of ecosystem which never interact with each other is a good place to start when wanting to move to serverless. Understanding what can function independently is as important as understanding your dependancies.

You may have noticed that what we’d built at this point weren’t really our business-critical products. I think that’s one of the most important lessons we’d hope to pass on — these weren’t our flagship products. They weren’t the products that if something went wrong we’d risk losing millions of pounds for our charity. But they were the right products for us to learn and test out different tooling before setting our sights on the higher profile products.

2019 consumer facing tech stack — Donate and SMS Gift Aid moved into serverless and more reusable services such as the marketing preferences and payment service layer.

By Red Nose Day 2019, we brought the big money products into our serverless stack too — the gift aid form and online donation application. While I wish I could say that in 2016 we had what now looks like a well-thought-through strategy to move our tech stack to be serverless — if you look back at our technology strategy from that time, the terms you’ll actually see are “Reusability” and “Reduced Costs” — and the best technology solution we’ve found to meet those strategic goals has been serverless.

Architecture diagram of the new donation platform (link for Cloudcraft diagram here)

To give an idea of just how fast serverless development is, the first code for our new serverless payments system was committed to Github April 29th 2018 and we had the new donation platform live in production week commencing September 4th 2018. With a team of 2–3 engineers we were able to take the learnings from the previous platform and build parity with it in terms of donations per second with a much leaner, cleaner, codebase.

Getting the product released early meant we could extensively test and iterate on the solution. When working with abstracted functions and services you’re also able to refactor your code as quickly as you write it, so no one’s getting precious about their masterpiece. We worked with the Armakuni team, NCC and AWS Well-Architected to challenge our approach and identify if there were any vulnerabilities which weren’t covered in our new system.

Old vs new architectures for our donation platform

Now, onto the donate product. To understand our platform, it’s good to understand the system it was replacing. Our previous donation platform was built by Armakuni, a third party that specialises in high resilience and high redundancy systems. To ensure 100% uptime, Armakuni built an application that was hosted multi-cloud and multi-region and could take up to 300 donations per second. It had also been one of our largest tech investments and never failed us in 6 years of Night of TV. So, needless to say, we had some big shoes to fill.

We knew we wanted to strive for functional parity with the Armakuni platform. Resilience and redundancy were areas which we really needed to focus on — it was unchartered water for us. To handle resilience in the past, the Armakuni platform had an always-on suite of Redis Datastore across multi-cloud to handle high volume moments, but this wouldn’t be an option for us in the serverless world.

Instead, our resilience was built in through redundancies across multiple services — if the RDS failed then we had SQS, if SQS failed then we had S3 backups. We built circuit breakers into every step of the application so if a function was to fail, then the transaction isn’t lost and the function can automatically retry.

AWS costs from 2013–2019 — comparing March 2015 (peak cost) with March 2019 (serverless cost)

One of the biggest wins for us was reducing our AWS hosting costs compared 2015 by 93% — this is including our remaining non-serverless architecture.

Something I didn’t get to say on the day: It’s worth pointing out here that these aren’t necessarily apples for apples comparisons — every campaign will have different requirements for hosting. One Twitter comment I enjoyed after the talk was about how we couldn’t have been optimized in the first place. That’s probably true, but also clearly shows I didn’t emphasise the requirement for resilience well enough. Comic Relief makes almost all public donations within a 7 hour window once a year — downtime for our applications during that period would cost us millions of pounds in donations. What’s changed is our attitude and trust in technology partners like AWS for our big events!

Actual costs for our donation serverless platform during the TV show for 2019 — $92

The best thing about it, in my opinion, is that we’ve got a system that could be slammed with those levels of traffic any day of the week — not just for a one-off event each year.

What you can take away from this talk

4 things we learnt along the way

#1: Identify your limiting factors.

When you move to serverless infrastructure, you’ll be pushing problems downstream. An example of this for us is with Payment Service Providers (PSP) — we integrated with multiple PSP for resilience in our platform. Each provider has limits which peak broadcast moments can exceed. We built simple functions for each PSP and if a PSP were to struggle under load — then we can limit the number of transactions that will go through them. Your serverless products will inevitably live in a hybrid ecosystem — recognise where your limits are and find solutions that mitigate those risks.

#2 Take advantage of rapid reusability.

When you change your mindset to functions-first, you can create a micro-service out of anything. I’ve already covered some of the reusability of our services and Storybook but I also want to call out the Lambda Wrapper which allows for boring repetitive code to be automatically and consistently applied to all functions so you can deliver new functions faster. Our movement towards services was really born out of engineers identifying when work was duplicated and taking responsibility to isolate and abstract reusable code.

#3 Use monitoring for optimization, not just catching problems.

Just a quick point on observability — we chose iOpipe and Sentry which enable us to identify and quantify impact of issues really quickly. We know down to the function what the problem is and who it’s impacting. This combined with Grafana dashboards allows us to have a really clear view of performance of the application and we can make changes to optimize experiences in real time — for example pushing more traffic through the highest converting PSP.

#4 Use serverless for load testing.

This can be done no matter which stage you’re at with your serverless discovery. Serverless has been a really cost effective thing for us in general but I think one of the best places where those savings are realised is in our load testing. Previously, we would spend a couple of thousand pounds each campaign just on load testing — now it’s a totally negligible cost. We’re able to run Drip, Slam and Ramp tests using the Serverless Artillery and load testing is democratized throughout the team so anyone can easily test out their work under load. It’s also a tool you could use without any of your infrastructure in serverless so why not give it a go?

We couldn’t have done it without…

I’m just going to finish up on a couple of things that we really wouldn’t have succeeded without.

Inquisitive engineers. Working with a team who are happy to learn new skills and try out new tools has been fundamental to our success. Not everyone has this inquisitive nature but we’re really lucky that our relatively compact team were willing to work out of their comfort zones.

Great tech partners. AWS were a natural choice for us moving into the serverless world — not only because they were market leaders but also because we’re able to pick and choose which parts of their offering work for us. Jason Thoday who is always able to open conversations for us in what’s new at AWS and how it can fit with Comic Relief’s goals and direction.

Trust from your organisation. This is something which our team has built up over the past few years. Building trust from senior management and capability within our team in the smaller products first, allowed us to make pretty drastic changes over a relatively short period of time. Sounds simple but the ability to experiment and explore is actually one of the most valuable things that you can give to your product teams.

Thanks so much for listening (well, reading), if you want to learn more details about what we’ve done and how we’ve done it, check out the Comic Relief tech blog and if you want to see our much-boasted about donation platform, please head straight to comicrelief.com/donate and test us out! Thank you.