Starting out with data puddles, then we’ll think about data lakes
Comic Relief is re-thinking its data ingestion, storage and query stack with Lambda, S3 & Athena. Here is a quick intro to how we are going about it and why.
Doing data right is time-consuming and hard! There you go the secret is out. But can we make it easier? Surely that is just part of engineering 101 and we should just accept it, right?
So what’s the problem
The issue for Comic Relief stems partly from the fact that we are now very comfortable with creating lightweight Serverless applications using commodity services (whenever possible), reducing code to the point where very little can go wrong as we are using off the shelf, battle-tested services and applications with the underlying code being basically just lightweight glue. This has given us the drive to reduce our code footprint as much as possible, surely if we write less code then we have fewer bugs :-)
Say we do all of the data integration work from the backend to our traditional data warehouse to satisfy our stakeholder’s requests, a month later the business needs change and we have to bring new data into the warehouse and backfill. Well, we have to,
- Alter the data contracts, to ensure data continuity and integrity.
- Update the producers to ingest the new data to our staging databases.
- Update the consumer to consume the new data.
- Write the new data to our staging databases.
- Figure out how to shoehorn the data into the warehouse, figuring out across our abstractions where the new data can fit.
- Consume from the staging data and send off to our user matching service.
- Consume from user matching service into the warehouse.
- Backfill the data in a one-off job
- Not forget to write tests
Defining our strategy
Organisations need to grow and evolve at speed, we need the tooling to be able to stay abreast of those requests and not impose our current belief on what the data we are storing should look like, but rather let the data dictate and us be able to easily satisfy whatever the business needs with ease. So there are two key approach changes that we have made or are in the process of making, these are.
1. Commodity Solutions where possible
Most of the business problems that we face have been solved. The problem we face is the belief that we can find one amazing solution to solve them all. Breaking down our problems into single use-case applications allows us to focus on our stakeholder's needs and generally pick an off the shelf solution, unless it is something so specific to our needs that we need to build it, like a high volume donation platform.
The simple summation of this philosophy is, “buy it if every business has the same issue, build if it is a specific problem to just us or gives us a competitive advantage”.
Taking this approach also means that we can utilise modern integration techniques to remove some of the data shipping work and focus on the bigger issues. Stitch is an awesome tool for this and allows us to ship data from Braintree, Stripe, Salesforce, Freshdesk and many other platforms into S3 in around about 3 clicks. Engineering time is the expensive thing, engineering time can be better spent and I can promise you it costs a lot more than Stitch a month.
As a team, we need to find our way up the value chain. Automate tasks, level up and don’t do them again. Share components across our frontends, level up. It’s all about automating away the monotony.
It also means that if one of our solutions aren’t fulfilling the needs of its users, then we just need to satisfy one target stakeholder groups needs to replace the failing part, rather than a massive multi-department consultation.
2. Flexible data puddles for the win
We have created a producer service written in nodeJS with Serverless Framework that exposes a private rest endpoint for data consumption. This matches the data to a data contract and then fires the data off into a queue that triggers consumption lambda’s every second.
Data Stores (Puddle Consumers) then subscribe to this queue and consume based on the individual data stores interests, whether that be transactional data, customer data or any other data.
Each store defines an AWS Glue Database and crawlers. The individual store takes the data from the queue and stores it in S3 in parquet format. We then have some CTAS queries running on lambda via events to concatenate and compress the files. Realistically we could have picked Glue Jobs for this use case and might still do, just didn’t feel very Serverless at the time of build.
We are then able to query the data using AWS Athena, either in code or via Tableau.
The first failure
So, was probably not going to mention the first attempt, but I sometimes forget that you need to fail sometimes in order to learn how to do something properly, it also makes a good story when everything isn’t awesome.
The first try was to replicate our entire data warehouse in something that looked kind of Serverless (but 100% wasn’t) which we called the data ingestion pipeline which was using Lambda and Aurora. The Alarm bells that this wasn’t the serverless masterpiece we had envisaged started to ring when we saw our AWS spend shoot up rapidly during the POC phase.
I won't go into to much detail for the sake of your sanity and this article. To sum it up though, the real failure was trying to eat an entire elephant in one go. The end result was getting squashed by the elephant, whereas I have been reliably told the only way to eat an elephant is in small chunks with lots of breaks to digest.
The above is probably a bad analogy for something I should have known. Iterative cycles, small single-purpose, single responsibility services and all of the rest of the stuff that has been banged into my head for the previous 15 years. But hey, if you aren’t failing, you aren’t learning, right?
The first real use case
So now for the first real use case, as with our initial foray into Serverless, it wasn’t a big mission-critical requirement. We had an internal request to make a Giro Request form for Schools. The data needed to be stored from the form and an endpoint needed to be exposed to one of our suppliers to be able to fetch a batch of the data, create the forms and send off to the schools.
We then used an Athena NPM module to create a private endpoint for the data to be fetched. All in all the approach was super simple and proved that the approach would have some legs.
The marginally scary should have picked something smaller second use case
So Comic Relief gets a lot of transactions in a short space of time, probably around 4 hours, with the peak coming in around a 45-minute slot at around 9 in the evening. We need to bring totals from our Online Donation platform, SMS providers, IVR and Call Centres and mash them into one massive total.
Serverless has allowed us to confidently take up to 350 donations per second. The problem is we also need to report that to the BBC periodically so that they can show a nice big total on the TV. It is also hugely important to have an auditable trail of how we got to this total, so doing it at a transactional & replicable and thus immutable level makes a lot of sense (and makes finance & compliance happy).
This used to be fine as we would just take a running tally. This year we decided that we wanted to be better and be able to provide faster real-time analysis of transaction attribution to call to actions and time-boxing. We also wanted to speed up the financial reconciliation process and allow for transactions to be tracked from the first click all the way through to money in our bank account. We also need to be able to report on this in very near realtime.
In previous years, we ran multiple systems, with speed layers and analytical layers with duplicated logic to ensure that we could fulfil the business needs. However hard we tried there was always some variance, whereas organisations want absolutes, as divergence suggests inaccuracy, which isn’t what anyone wants.
So we used the same consumer as in the previous use case and created a transactional store that read messages off of it. Created some Glue Databases and some Athena queries and bam, we had our new way to get fast and very accurate totals. Realistically there are some massive engineering hurdles in the background here and we are doing some caching and aggregation to make stuff work as it should.
End result though is no servers, no databases and a system that scales with zero effort from us, which we can use and rely on at peak second and then 5 months after with no cost in the middle except for storage.
The lessons learned & what’s next
The next steps for us are to rinse and repeat the process wherever possible and begin the process of lifting and shifting. So here are the lessons learned,
- You have to get lost in the woods to know how to find your way out of them. The first attempts didn’t satisfy our needs, however, the underlying learnings took us through to a point where we could satisfy our business needs. This, unfortunately, takes perseverance and that can be hard to justify within an org for business-critical systems.
- We aren’t big data, we get lots of data fast. Listen to your own needs and not the general use case patterns of the systems that you are implementing.
- Speed to implement, Speed to run & Cost to run. If the speed to implement is fast, the speed to run is fast enough and the cost to run is low, then why are you not doing it. Fast enough is a key point here and is why I enjoy working with a solid product team so much, the ability to get down to base stakeholder needs vs requests is a rare art.
- Commodity solutions for the win. Your organisation has needs that make it special, so spend your time on problems that are individual to it. Sure, knitting’s fun for some people, but you’re on the company’s time, go buy the jumper from the store and don’t waste your time knitting a shitty one.
- Data tooling has changed. Your TV can link up to your thermostat, the same tooling is happening in the systems integration world. Focus on the queries and generating actionable business insights and not spend all your time landing the data in a warehouse.
- Focus on value creation. Try to automate away the most arduous of your teams' tasks, rip apart the process book and let them focus on the fun stuff that can challenge them.
- Take yourself to your end-user. The first part of getting in front of a bad process is taking yourself to the process and then finding efficiencies. Iteration takes people on a journey, wholesale replacement is a dictatorship.