CreditorWatch is well known in the Australian Fintech space for being one of the most (if not the most) innovative companies in the industry, winning several awards for it. This innovation comes from its dev culture promoting innovation, and, unfortunately, sometimes, it also comes with a price.
Adopting new technology in your production environment is exciting and challenging but also quite scary sometimes. This is our story of what we learned while adopting Lambda functions in our production system.
In CreditorWatch, we use lambda functions extensively in our event-driven architecture https://medium.com/swlh/aws-dynamodb-triggers-event-driven-architecture-61dea6336efb and to create push queues https://medium.com/creditorwatch/how-to-successfully-create-a-push-queue-using-sqs-lambda-57f299056fe7.
At this point, we have more than 250 lambda functions that have executed over 2 billion times in total, so I guess we can say that we have been around them long enough to get a good grasp of what’s great and what’s not-so-great.
Most of the time (not always), we use them in combination with Kinesis streams or SQS queues so, some of the things below will probably be related to them as well.
This hasn’t come without challenges and learning moments, so we’ll put in this small article the most important ones.
Don’t expect things in order
This is, in our experience, mostly related to Lambda with Kinesis streams, but it applies to many situations where you have a lambda trigger.
Even though AWS claims that lambda processes things in order (https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html), this is not the case, and there are a few reasons for that.
The first one is that you might have several executions at the same time (https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html), so even though you might expect your executions to happen in order, one lambda function might take more time to spin up than another one, and there, even though the lambda functions have started one before another you have your code executed in the reverse order.
Another reason is executions in batches. A lambda function (when connected to a Kinesis stream of SQS queue) accepts batches of records (not just 1) https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html.
Therefore, you might have one batch of 10 executed first (and, let’s assume you process them sequentially in your lambda function) and in 1 second, the second batch of 10.
For some reason, the first event of the first batch might take (let’s say) 10 seconds, while the second batch executes all their records under 0.5s. In that case, the execution of the second batch has finished before the first one, effectively processing the records in a different order than they arrived.
Idempotency is instrumental
I cannot stress this enough. Code fails—all the time. And you have to be ready for it. I know it might seem basic, but it’s a necessary reminder.
Lambda will fail as well, for many reasons. It might be because there’s an actual bug in your code because you have a timeout (https://docs.aws.amazon.com/lambda/latest/dg/configuration-console.html) because you are out of memory…
Once a lambda execution fails, it will retry to re-process the whole batch, not just the one that failed. This inevitably means that the records that succeeded before the failure will be re-processed.
And this is why idempotency is so important when working with lambda functions. Retries might and will happen, so your code needs to expect this.
Our lambda functions are mostly calls to endpoints (we use lambda functions to kind of act as a push queue or callback manager), so, in our case, those endpoints are the ones that need to be idempotent (and they are!).
The good thing about this approach is that all the logic for idempotency and, of course, the application logic is in the same place. Also, we would be able to change Kinesis+lambda or SQS + lambda for some completely different architecture (such as implementing the Observer pattern instead of our current Pub/Sub pattern) without changing our codebase.
If you didn’t have an application to point to handle idempotency and all you had was the code in your lambda, that’s fine; you could always store a hash of the record (or the provided record ids if you will) in some processed storage (Redis might be a good choice)
Control your executions
Lambda is built for scale, and that’s great. But don’t let lambdas out scale you. Even though we all like to say that our systems scale perfectly and so on, we have a limited amount of resources (not just technologically, but engineers also have budgets!).
So, for that reason, even though you can have thousands, or hundreds of thousands (https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html ) lambda executions for the same function at the same time, that might be your doom.
If you are doing heavy queries to a database, you can bring it down. If you are using it to do a callback to one of your services, you can overload it and, if you are calling an endpoint to a third-party service, you might receive a quite unpleasant phone call.
So, in this case, a good approach is to throttle your lambda executions (that, luckily, is already built-in in AWS).
Now, how fast you can crunch through your records is determined by 2 things: Concurrency and BatchSize.
Probably you are surprised by the last one. This is due to the implementation of Lambda itself. In the end, even though it’s “serverless”, your code needs to be somewhere, and the lambda function needs to spin up somehow. We are not going to get into the very specifics of how that happens, but note that that process needs some time.
That time will effectively impact the number of records you can process per second. Let’s put it in an example. Let’s say that lambda takes 0.3s to be invoked, and you have a batch size of 1 (you only process 1 record per lambda execution). If the processing time for 1 record is 0.2s, the whole process will take 0.5s. In that case, you can process 2 records per second for that function. If you have a concurrency of 5 (5 executions in parallel), you can process 10 records per second.
Now, let’s say you have a batch size of 10. In that case, you’ll pay the 0.3s penalty to invoke the lambda function and process 10 records in 2s, totaling 2.3s per execution. If you had 5 processes (just as before), you would have processed 50 records in 2.3s, 21.73 records per second.
Scaling using BatchSize also might save some money since you pay per request (https://aws.amazon.com/lambda/pricing/), so that’s how most of our scaling tweaks come from BatchSize instead of Provisioned concurrency.
On the not-so-bright side of scaling by BatchSize is that, as we mentioned before, if one of your records fails, you’ll need to re-process much more records than with a smaller batch size.
Monitoring and alerting is key
Being serverless doesn’t mean that it’s not critical or that it can be monitored more lightly.
In our case, this architecture is instrumental, and in the backbone of the whole company, and for that reason, it’s tightly monitored, and many alerts are configured.
I’d recommend a mix of APM metrics (if you are using callbacks, as we do) and Cloudwatch metrics to have a full picture of the wellbeing of your system.
You can take a look at some metrics here. In my opinion, probably the most important ones are the invocations (determine the cost), error count and success rate (detect sporadic or continuous failures) and IteratorAge, which’s probably the most important of them all.
IteratorAge is when your oldest record has been waiting to be executed (when linked to a Kinesis stream, SQS or other).
This gives you, in a simple glance, a metric of how healthy your lambda function is, and it’s the one we use in our alerting system. If some record fails and gets stuck re-processing, the IteratorAge will go up. If the batch size is too small and you can’t cope with your records, the IteratorAge will go up. If the concurrency limit is too low, the records will pile up trying to get executed, and (surprise!) the IteratorAge will go up.
Obviously, you also need to monitor the resources that lambda uses (being a DB, Redis node or a service), and that’s why we use some APM tools that will alert if the traffic is too high or the average response time seems to be climbing.
CI/CD rules apply
Lambda provides a straightforward way not just to deploy your application code but to change it interactively in its console.
Any engineer with its AWS CLI configured correctly can deploy lambda with just 1 command without being peer-reviewed, approved or audited by anyone.
I would strongly recommend you against it.
On top of lambda functions and Cloudformation stacks, we also use the serverless CLI tools (https://www.serverless.com/framework/docs/providers/aws/cli-reference/deploy/), which includes the SLS deploy.
Especially at the beginning, when we were experimenting with scale (remember, a combination of BatchSize and Provisioned Concurrency), we would tweak some lambda parameters and deploy them straight from our terminals into production and, at the end of it, when we were happy with the performance, we’d commit the code.
That sounds like a reasonable plan, except for the times when we got distracted and forgot to either push the changes (or merge them) or to pull the most recent changes and deploy our changes.
We learned quite quickly that this approach was obviously flawed and, even though deploying was just 1 line in a terminal, we needed to create deployment scripts and implement them into our CI/CD pipeline.
We use Jenkins for that, so now we have all our listeners in there, and we can see exactly who, when and how every deployment has been made. Needless to say, this has made our development much safer.
This is just a small article with 5 lessons we learned, but we could go on forever about how we learned to master lambda functions and other uses (we use them as a crontab, for instance).
Heavy use of lambda was a risk we took when serverless was not the mainstream, but with time (and a few painful lessons), we learned how to master them, and now they are ingrained in our day-to-day operations, and we wouldn’t choose otherwise.
I hope that with this article, you don’t make the mistakes I made and get encouraged to try them, either for your production environment or for any pet project you might have lying around.