The Journey to 90% Serverless at Comic Relief

The 90% figure is plucked out of thin air — it’s just me trying to pick a really big number without accurately calculating a percentage

Adam Clark
Mar 21, 2019 · 13 min read

Since its launch in 1988, Red Nose Day has become something of a British institution. It’s the day when people across the land can get together and raise money at home, school and work to support vulnerable people and communities in the UK and internationally.

Now that Red Nose Day 2019 is over, I want to give you a bit of a run down on my learnings, the tooling that we used and the path taken, on what turned out to be nearly a full cloud migration from a containerised/EC2 ecosystem and into the world of serverless.

Upon starting at Comic Relief in April 2017, I joined a well-developed team that were already working in the cloud-native microservice world, with some legacy products operating on a fleet of EC2 servers (such as the website) and the rest either running on Pivotal Web Services (Cloud Foundry) or outsourced (Donation Platform).

Comic Relief was generally using Concourse CI (awesome tool) to deploy to Cloud Foundry or was using Jenkins for its legacy infrastructure. We frequently used Travis CI to run unit tests and do some code style tests, but have now moved over to CircleCI — thanks to a thorough comparison by Carlos Jimenez.

Nearly all of the code was written in PHP and was either Symfony or Slim Framework. I had quite a bit of experience of NodeJS at my previous role at Zodiac Media, was a big fan of running the same language on the front as on the back and not having to context switch when swapping between the two.

Pretty much the first job that I was tasked with on arrival was to bring the contact form, which was part of the comicrelief.com Drupal site, into the serverless world.

The project was my first introduction into Lambda and Serverless Framework and also my first lesson as to what works well in serverless. I created a serverless backend in NodeJS that accepted a POST request and then forwarded that on to RabbitMQ.

My colleague Andy Phipps (Senior Frontender) created a frontend in React which we dropped into S3 and created a CloudFront distribution in front of it. We then built a Concourse CI pipeline that ran the SLS DEPLOY command and ran some necessary mocha tests against the staging deploy and then deployed to production and did much the same for the frontend.

It was a liberating experience and became the foundation for everything that we would do after.

The contact service backend

The next thing we wanted to do was to send an email from the contact service to our email service provider. I created another Serverless service that accepted messages either from RabbitMQ or via HTTP that again just took the message and formatted for our email service provider and forwarded on.

We then realized that we were using the same mailer code in our Symfony fundraising paying-in application, so we stripped out the code from payin and pointed it to our new mailer service. At this point, I started to realize that the majority of web development was some mix of presenting data and handling forms.

The mailer pipeline

We then halted active development and steamed into Sport Relief 2018; this allowed us to test our assumptions towards Serverless and gain some real-world experience under heavy load.

The revelations were as follows:

  • Cloudwatch is a pain to debug quickly, but a good final source of truth.

We then went into a significant business restructure, which meant that a lot of our web-ops team ended up leaving. The result was a remit to simplify the infrastructure that we were using so that a smaller group could manage it, bringing ownership, responsibility and control to the development team. The approach was championed by our Engineering lead Peter Vanhee, without that, where we are now would never have happened.

The next obvious target was our Gift Aid application; the most basic description of it is a form that users submit their details so that we can claim gift aid on SMS submissions. The traffic hitting this application is generally very spikey off the back of a call to action on a BBC broadcast channel, ramping up from 0 to 10’s of thousands of requests in a matter of seconds.

We traditionally had a vast fleet of application and varnish servers to back this (150+ servers on EC2). As one of the most significant revenue sources, this gets a lot of traffic in the 5 hours that we are on mainstream TV and also in the lead up to it, so there was very little wiggle room to get it wrong.

At this point, we diverged from RabbitMQ and started deploying SQS queues from serverless.yml. My colleague Heleen Mol built a React app using create-react-app, and react-router hosted again on S3 with CloudFront in-front of it, this was and is the foundation of every public facing application we make, it can handle copious levels of load and takes zero maintenance.

The Gift Aid Site

At this point, it was apparent that we needed a good way to document our API’s alongside the code. We had previously been using swagger on our giving pages. However, it seemed a bit of a pain to set up, and I wanted something static that could be chucked into S3 and forgotten. We settled on apiDOC as it looked like it would be quick to integrate and was targeted at RESTful JSON API’s, it mainly fulfilled my requirement of being able to set up a proof-of-concept within an hour.

Comment Block from Payments API
apiDoc documentation from Payments API

The primary donation system was previously outsourced to a company called Armakuni, who had built an ultra-resilient multi-cloud architecture across AWS and Google Cloud Platform.

With our clear remit to consolidate our stack and the successes that we had already had in building a donation system for our Sport Relief app in Symfony that was a fork of our Paying In application, It really seemed like the next logical step was to bring the donation system in-house. This allowed us to share components and styling from our Storybook and Pattern Lab across our products, severely reducing the amount of duplication.

Our Footer in Storybook

It should be noted that at this time we already had a payment service layer that had been built in previous years in Slim Framework which ran the Sport Relief app donation journey, our giving pages and shop.

As Peter (engineering lead) was heading away on paternity leave, the suggestion arose that if there was any time after moving over the giftaid backend to Serverless that I could create a proof of concept for the donation system in Serverless Framework. We agreed that as long I had my main tasks covered, then I would be able to give it a go with any time I had left. I then went about smashing out all of my tasks quicker than I was used to, to get onto the fun stuff ASAP!

After talking to the super knowledgable guys at Armakuni after the wrap up from Sport Relief, it was clear that we needed to recreate the highly redundant and resilient architecture that Armakuni had created, but in a serverless world.

Users would trigger deltas as they passed through the donation steps on the platform, these would go into an SQS queue, and then an SQS fan out on the backend would read the number of messages in the queue and trigger enough lambda’s to consume the message, but most importantly not overwhelm the backend services/database.

The API would load balance the payment service providers (Stripe, Worldpay, Braintree & Paypal), allowing us to gain redundancy and reach the required 150 donations per second that would safely get us through the night of TV (it can handle much more than this).

I initially put in AWS parameter store to store payment service provider configuration, this was free and therefore very attractive in a serverless world, but proved woefully incapable under load and was swapped out for storing configuration in S3.

I then created a basic frontend that would serve up the payment service provider frontend based on which provider the backend. Imported all of the styles over from the Comic Relief Pattern Lab and was good to demo it to Peter, and the team on his return.

The donation platform

Upon Peter returning, we went through the system, discussed it’s viability and did some necessary load tests using Serverless Artillery, concluding that we could do what we thought we couldn’t!

A business case was put together by Peter and our Product Lead Caroline Rennie, and away we went. At this point, Heleen Mol and Sidney Barrah came on board and added meat to the bones, getting the system ready to go live and the ever impending night of TV.

Serverless Artillery load testing reporting to InfluxDB and viewed in Grafana

Due to the nature of Red Nose Day, you don’t get many chances to test the system under peak load. We were struggling to get observability of what was going on in our functions using Cloudwatch.

At this point, Peter recommended that we try a tool that he had come across, which was IOPipe. IOPipe gave us unbelievable observability over our functions and how a user is interacting with them; it changed how we used Serverless and increased our confidence levels substantially. For more information, check out our article on Serverless Monitoring.

IOPipe function overview

At this point we also integrated Sentry, which alongside IOPipe gave us the killer one-two punch of being able to get a 360 view of errors within our system, allowing us to quantify bugs for our QA team (lead by Krupa Pammi) and trace the activity that caused them quickly and efficiently. I can’t think of a time where I have been able to have such an overview of everything going wrong, pretty scary, but excellent.

A Sentry bug from the Donation Platform

The next big part of the puzzle was the decision that we were copying way too much code between our Serverless projects. I had a look at Middy based on a recommendation from Peter, but at the time there wasn’t a vast amount of plugins for it, so decided to spin out our own lambda wrapper rather than having to learn and make plugins for a new framework and possibly run into Middy’s limitations (probably none).

I am still not sure yet how bad of an idea this was, however, it seems to work at scale, is easy to develop with and simple to onboard new developers, which is enough for it to stay for the time being.

Lambda wrapper encompasses all of the code to handle API Gateway requests, connect and send messages to SQS queues and a load of other common functionality. Lambda wrapper resulted in a massive code reduction across all of our Serverless projects. It also meant that the integration of Sentry & IOPipe was common and simple across all of our projects.

The donation pipeline in ConcourseCI

To add extra redundancy to the project, we introduced an additional region and created a traffic routing policy based on a health check from a status endpoint. We figured the chance of losing two geographically separate AWS regions was very low.

We also backed up all deltas to S3 on a retention policy of 5 days, to ensure that we could replay all deltas in the event of an SQS or RDS failure. We added timing code to all outbound dependencies using IOPipe and also created a dependency management system so that we could quickly pull out dependencies (such as Google Tag Manager or reCAPTCHA) from external providers at speed.

IOPipe tracing

Based on a suggestion from AWS off the back of a well-architected review. We also added a regional AWS Web Application Firewall (WAF) to all of our endpoints, this introduced some basic protections, including stuff we already had covered, but higher up the chain, before API Gateway was even touched.

Another piece of the puzzle was to get decent insights into our delta publishes and processing, this gives us another way to get a good overview of what is happening with our system. We used InfluxDB to do this and consider it as an optional dependency of our system.

It was important for us to understand what our applications critical dependencies were, thus forming our application health check status and whether we would fall over to our backup region. InfluxDB is fantastic, however, is self-hosted. When AWS Timestream comes along, this will be out the door.

Delta processing from InfluxDB, viewed in Grafana

So the night of TV came and went on the 15th of March and the system performed nearly exactly as expected. The one unexpected, but now apparent weak point was the amount of reporting that we were trying to pull from the RDS read replica using Grafana and our live income reporting, we lowered our reporting requirements and were back on track within no time.

We originally used RDS so that we could achieve compatibility with our legacy payment service layer, in the future we will probably replace this with something more Serverless. Relying on AWS Timestream for more real-time analysis (when it arrives).

API Gateway traffic spike for March 15th

So to sum up this epic and overly long rundown of the journey to 90% Serverless:

  • Try to get everything Serverless if you can, our highest monetary cost is RDS. It’s still nice to be able to run the SQL queries that we know and love, Athena and S3 are probably a solid replacement.

The biggest lesson for me is that in my day job, I exist to solve business issues. I think sometimes as technologists we forget this. Serverless is the fastest way to decouple oneself from rubbish problems, move up the stack and move on to the next issue.

To learn more about our journey to serverless, check out these blogs from the Comic Relief Technology Blog about pipelines and our journey from microservices to functions.

Be sure to watch this presentation by our Engineering Lead Peter Vanhee talking through the current architecture at Serverless Computing London, as well as this presentation featuring our Product Lead Caroline Rennie around the previous donations platform and the problem space.

Comic Relief Technology

This is where we showcase the ways that we are use…