Starting out with data puddles, then we’ll think about data lakes

Comic Relief is re-thinking its data ingestion, storage and query stack with Lambda, S3 & Athena. Here is a quick intro to how we are going about it and why.

Adam Clark
Jan 31, 2020 · 8 min read

Doing data right is time-consuming and hard! There you go the secret is out. But can we make it easier? Surely that is just part of engineering 101 and we should just accept it, right?

So what’s the problem

Say we do all of the data integration work from the backend to our traditional data warehouse to satisfy our stakeholder’s requests, a month later the business needs change and we have to bring new data into the warehouse and backfill. Well, we have to,

  • Alter the data contracts, to ensure data continuity and integrity.
  • Update the producers to ingest the new data to our staging databases.
  • Update the consumer to consume the new data.
  • Write the new data to our staging databases.
  • Figure out how to shoehorn the data into the warehouse, figuring out across our abstractions where the new data can fit.
  • Consume from the staging data and send off to our user matching service.
  • Consume from user matching service into the warehouse.
  • Backfill the data in a one-off job
  • Not forget to write tests

Defining our strategy

1. Commodity Solutions where possible

The simple summation of this philosophy is, “buy it if every business has the same issue, build if it is a specific problem to just us or gives us a competitive advantage”.

Taking this approach also means that we can utilise modern integration techniques to remove some of the data shipping work and focus on the bigger issues. Stitch is an awesome tool for this and allows us to ship data from Braintree, Stripe, Salesforce, Freshdesk and many other platforms into S3 in around about 3 clicks. Engineering time is the expensive thing, engineering time can be better spent and I can promise you it costs a lot more than Stitch a month.

As a team, we need to find our way up the value chain. Automate tasks, level up and don’t do them again. Share components across our frontends, level up. It’s all about automating away the monotony.

It also means that if one of our solutions aren’t fulfilling the needs of its users, then we just need to satisfy one target stakeholder groups needs to replace the failing part, rather than a massive multi-department consultation.

2. Flexible data puddles for the win

Data Stores (Puddle Consumers) then subscribe to this queue and consume based on the individual data stores interests, whether that be transactional data, customer data or any other data.

Each store defines an AWS Glue Database and crawlers. The individual store takes the data from the queue and stores it in S3 in parquet format. We then have some CTAS queries running on lambda via events to concatenate and compress the files. Realistically we could have picked Glue Jobs for this use case and might still do, just didn’t feel very Serverless at the time of build.

We are then able to query the data using AWS Athena, either in code or via Tableau.

The first failure

The first try was to replicate our entire data warehouse in something that looked kind of Serverless (but 100% wasn’t) which we called the data ingestion pipeline which was using Lambda and Aurora. The Alarm bells that this wasn’t the serverless masterpiece we had envisaged started to ring when we saw our AWS spend shoot up rapidly during the POC phase.

I won't go into to much detail for the sake of your sanity and this article. To sum it up though, the real failure was trying to eat an entire elephant in one go. The end result was getting squashed by the elephant, whereas I have been reliably told the only way to eat an elephant is in small chunks with lots of breaks to digest.

The above is probably a bad analogy for something I should have known. Iterative cycles, small single-purpose, single responsibility services and all of the rest of the stuff that has been banged into my head for the previous 15 years. But hey, if you aren’t failing, you aren’t learning, right?

The first real use case

We created a simple Serverless project that took the data, stored that in parquet to S3 and defined the necessary resources in cloud formation to define the database in AWS Glue.

We then used an Athena NPM module to create a private endpoint for the data to be fetched. All in all the approach was super simple and proved that the approach would have some legs.

The marginally scary should have picked something smaller second use case

Serverless has allowed us to confidently take up to 350 donations per second. The problem is we also need to report that to the BBC periodically so that they can show a nice big total on the TV. It is also hugely important to have an auditable trail of how we got to this total, so doing it at a transactional & replicable and thus immutable level makes a lot of sense (and makes finance & compliance happy).

This used to be fine as we would just take a running tally. This year we decided that we wanted to be better and be able to provide faster real-time analysis of transaction attribution to call to actions and time-boxing. We also wanted to speed up the financial reconciliation process and allow for transactions to be tracked from the first click all the way through to money in our bank account. We also need to be able to report on this in very near realtime.

In previous years, we ran multiple systems, with speed layers and analytical layers with duplicated logic to ensure that we could fulfil the business needs. However hard we tried there was always some variance, whereas organisations want absolutes, as divergence suggests inaccuracy, which isn’t what anyone wants.

So we used the same consumer as in the previous use case and created a transactional store that read messages off of it. Created some Glue Databases and some Athena queries and bam, we had our new way to get fast and very accurate totals. Realistically there are some massive engineering hurdles in the background here and we are doing some caching and aggregation to make stuff work as it should.

End result though is no servers, no databases and a system that scales with zero effort from us, which we can use and rely on at peak second and then 5 months after with no cost in the middle except for storage.

The lessons learned & what’s next

  • You have to get lost in the woods to know how to find your way out of them. The first attempts didn’t satisfy our needs, however, the underlying learnings took us through to a point where we could satisfy our business needs. This, unfortunately, takes perseverance and that can be hard to justify within an org for business-critical systems.
  • We aren’t big data, we get lots of data fast. Listen to your own needs and not the general use case patterns of the systems that you are implementing.
  • Speed to implement, Speed to run & Cost to run. If the speed to implement is fast, the speed to run is fast enough and the cost to run is low, then why are you not doing it. Fast enough is a key point here and is why I enjoy working with a solid product team so much, the ability to get down to base stakeholder needs vs requests is a rare art.
  • Commodity solutions for the win. Your organisation has needs that make it special, so spend your time on problems that are individual to it. Sure, knitting’s fun for some people, but you’re on the company’s time, go buy the jumper from the store and don’t waste your time knitting a shitty one.
  • Data tooling has changed. Your TV can link up to your thermostat, the same tooling is happening in the systems integration world. Focus on the queries and generating actionable business insights and not spend all your time landing the data in a warehouse.
  • Focus on value creation. Try to automate away the most arduous of your teams' tasks, rip apart the process book and let them focus on the fun stuff that can challenge them.
  • Take yourself to your end-user. The first part of getting in front of a bad process is taking yourself to the process and then finding efficiencies. Iteration takes people on a journey, wholesale replacement is a dictatorship.

To learn more about our journey to serverless, check out these articles from the Comic Relief Digital & Innovation Blog about our journey to serverless and the business value it has created.

Comic Relief Technology

This is where we showcase the ways that we are use…