How we moved to serverless architecture at Goalwise using AWS Lambda
We believe in lean and efficient teams rather than large teams and then having the overhead of managing them to improve productivity. Goalwise, being a fintech start-up, has a lot of focus on process automation. This reduces human resources required and manual errors.
Currently there are a lot of features that need number crunching on a daily basis like the capital gains calculations, streaks calculation, scheduled transaction generations and many more. At the early stages of the company cron jobs were written which ran on the EC2 servers. This solution was not my first choice but given the time constraints it was a good one. Cron jobs are a great way to get scheduled tasks done as long as the server doesn’t die.
As we started shipping more features, the number of cron jobs started increasing. This brought about a new problem of increasing the EC2 capacity. I had managed to run all the scheduled jobs on t2.micro instance with code written to run for frugal memory consumption and some clever cron job scheduling. This approach though economical, made the code very hard to maintain and add enhancements. Eventually as the number of users increased I reached to a point where the only way out was to increase the EC2 size. With the customer base increasing by about 10% monthly there would be a need to move to a larger EC2 instance every 9 months (This is a rough calculation based on the ratio of number of api drops to memory consumed by the server process threads running. Every time there is an api call dropped error registered more memory is allocated to the server. Eventually there will not be enough memory left on the EC2 instance to allocate). This translated directly into increasing cost for the company (almost 2x every 9 months).
I started looking for ways to get over these two problems. One thing was very clear that increasing the server capacity was not the solution to the problem. First thing I did was that I had a look at the existing code and checked if there is anything that can be written better. I listed them down but did not go about making these changes straight away as I was not sure if we would be needing the code at all in the future.
Next I had a discussion with the product team to get a better view of the upcoming features not just for the next month or quarter but for the year. This discussion did not just include the features that were already scheduled to be released at some point in time but also ideas which hadn’t been thought through yet as a fully formed feature.
After the meeting with product team I met with the operations team to understand their pain points and what parts of their daily jobs could be automated.
Next was the customer experience team. Goalwise provides top notch customer experience and anything that can be done to help the customer experience team do their job better had to be done.
These meetings were very helpful as they gave me an overall idea what was coming tech’s way in the future and how much heavy-lifting scheduled jobs will need to do. After analysing all the requirements that I was able to gather, I started looking at options available to get things done on AWS.
After doing a bit of research I shortlisted two solutions which could fit my needs:
- the spot instance way, or
- the server less Lambda way
To make my final decision I had a look at the requirements again. Turns out most of the requirements were going to be triggered based on two events — user action or scheduled for a particular time in the day. This essentially meant that the code that was going to be written for the scheduled job would also be a part of the product as an event-triggered calculation. As a developer I did not want to maintain multiple sets of code doing the same thing. I saw two main problems with the spot instance approach.
- Reliable provisioning for time sensitive task while the triggers for lambda has failure retries built in.
- I would still have to keep increasing the EC2 size as the traffic on the website increases while lambda gave me the option to scale horizontally without any extra effort.
AWS Lambda was the way to go.
I was all excited to start working on AWS Lambda but it came with its own set of limitations. Though lambda was easy to trigger on user actions it was not designed to run batch jobs. ‘Step Functions’ stepped in to save the day. It allowed the job orchestration and reporting the way which suited my needs best.
In a couple of weeks of code, I was able to write a basic framework for serverless jobs for Goalwise. This framework focused mostly on dealing with the AWS APIs and laid out some basic standard for writing jobs. With these building blocks in place I could run hundreds of lambdas in parallel to do the calculations and then combine the results and flush them to the database depending on the class the of the database and load on it :) I am not going into the details of it in this blog (may be later I will write one).
I ran it in production in parallel with one of the existing cron jobs and monitored the results. With a few more changes to the framework, code was performing as expected but with a lot more reliability. It was time to flip the switch and go live with it.
Three months later almost all the scheduled jobs are now serverless with the framework further evolving into performing more and more of the dev-ops tasks ‘codomaticaly’ thanks to the AWS APIs. At present the dashboard data, streak calculation, capital gains report calculation, KYC recon, graph calculation and a lot more are run on lambda.
To do the same thing on EC2 instance I would need a t2.medium machine which with the yearly no upfront reserved instance applied would approximately cost $45. The current usage of lambda and step functions combined is $4.87. By the end of this year hopefully I will be able to migrate the remaining few jobs too and shut down the EC2 :) On the website and app we roughly get about a million api calls a month which is being handled by a single t2.small instance. I hope to move move all the api calls to lambda 3rd quarter next year. By my calculations, this should reduce the compute cost by almost 80%!