Over two years ago, we started our Serverless journey at Immoweb.
As with every new technology, the learning curve was steep, mainly due to the asynchronous, event-driven microservice architecture pattern and to the fact that we were not managing our own infrastructure anymore.
Our experience remains positive so far, with some pitfalls. I want to take the opportunity to review this journey while focusing on some critical steps of this migration: the bad, the good and the ugly.
What should we start with?
We took our first step in Serverless with a non-core business feature: the management of our surfers’ GDPR consent.
The perfect case to set up a microservice:
- It was a new feature
- more or less isolated
- with a short time to market target, due to European regulations
Our API team main language was Java at the time, and it seemed quite natural for its members to continue developing using it. At first, we were a bit disappointed by the speed of our Lambda functions and mainly by what we called the “Cold Start”. After reviewing several articles about Lambda speed comparison, we saw that Java was clearly slower to start up than other languages (NodeJs or Python).
Including a lot of libraries or heavy framework also turned out to be a bad idea. The overall size of your function (libraries included) impacts the performance a lot when your function is bootstrapping for the first time. Hopefully, after the first execution, Amazon is keeping your function warm and in memory for up to ten minutes. That’s good for functions accessed frequently, not for sporadic execution.
Finally, the amount of memory you allocate to your Lambda function impacts the speed of execution (and the cost). Several tools are available to optimize the cost performance, find the links below.
There is lot of fine-tuning practice, but at least these three are key:
- Coding Language
- Size of your code/package
- Memory allocation to run your Lambda
This project was released on time and with acceptable performance. But we knew that we had to dig deeper and learn more on how to use this new technology with better performance.
Is an AWS Lambda fast and robust enough to support a Search API?
Immoweb is a digital real estate classified player; our core business relies heavily on a robust and fast search feature. With around 500.000 visits and millions of search queries per day, there is little to no room for failing the search API rewrite.
Hopefully, we had written a couple of months earlier a brand new Advertising service using Lambda, based on our previous experience. We were not really starting from scratch. To implement this Advertising server, we started to use lightweight NodeJS and some caching technics to speed up the Lambda execution. This second attempt was a success. The two functions used to run this service are reaching up to 110 calls/second on heavy load, with a really low, down to zero error rate.
We were ready to implement our CORE functionality in Serverless the Search API.
The Search API
The current architecture is quite common. The clients, our website and mobile applications, are calling a CloudFront cache. If the cache already exists for the same search criteria, the CloudFront sends directly the answer to the client (response in +- 10ms). If the cache doesn’t exist or if the cache has expired, the request is forwarded to the API Gateway which triggers the Lambda function. This lambda is responsible for translating the query into the Elasticsearch query format, getting the response and executing some business transformation into a nested readable JSON format sent back to the client.
We are currently serving up to 1.900.000 search queries per day across our whole platform.
Our CloudFront cache is serving up to 45% of the requests for the website and about 13% for the mobile applications. The average execution time for the Lambda search is about 70ms.
This service has been running for several months without any major issue. It has been constantly updated and deployed with an incredible ease of operation. We don’t have to do the complex math of server provisioning or implementing complex autoscaling rules. The system adapts itself gently according the number of requests to serve.
There is some room for improvement like, decreasing the response time of all the stack, planning a canary deployment and setting up our search engine into a multi-region active-active DRP. (link below)
Event & Lambda, the “silver bullet” that might kill you?
Mastering Lambda takes time and sometimes you can face a really bad situation like we faced recently. In order to insert several documents in one DynamoDB, we decided to implement a simple ETL in Serverless with the following architecture. Files are uploaded on a S3 bucket, a Lambda function is triggered on PUT and POST events, when a file is uploaded. This Lambda reads and transforms the data to be loaded into the DynamoDB. It’s a pattern that we had already used on another project and it was working well .
Nonetheless, this project turned out to a total failure that occurred in one development environment, for the following reasons.
Several mistakes in the implementation were made, as you can see represented by the red arrow on the schema. The lambda was processing several rows in each file(.csv). For each row falling in error, a log file was also output on the S3 bucket and the full content of this error file was sent to CloudWatch at the same time. As the S3 PUT/POST event was triggering, every write action on the log file was launching the Lambda function. We faced a kind of “infinite loop”, based on events. This function was running behind the scene.
The Lambda concurrency reached the soft limit of our account, 1000 concurrencies, during several hours. Hopefully, a team member saw an abnormaly huge amount of Lambda executions running on a dev environment, through our monitoring, and stopped the leak.
But it was too late, the amount of resource consumed on AWS was already high, way above a normal usage, just in few hours.
Over the course of fourteen hours, we had reached above 280 Million executed Lambda and 42TB of logs had been collected into CloudWatch. Finally, we had several GB of data transfers through the Nat Gateway.
The cost impact of Lambda execution was high but acceptable, compare to the CloudWatch logs ingestion, where every GB ingested cost about 0.50$.
Unfortunately for us, no budget alarm had been correctly setup on the account.
Serverless = Limitless : Lesson learned
As discussed with some colleague, back in time, when we were managing our own datacentre and servers, there was a physical limit like hard drive space, numbers of machines, that were limiting the outbound…
With cloud services and serverless, it’s magic, it’s scaling on demand both in computing power and storage. Being careless can lead to high expense and disaster.
Hopefully in this kind of scenario, most of the time Amazon is open to discussion to limit the financial damage, if you put some measure in place to avoid the same scenario later.
Here are usual good practices to enforce, across teams:
Usage and Budget Alarms
It seems obvious, but always set several budget Alarms, a good practice is to do it on different percentages based on the previous month usage. For example, set several alarms at 60%, 70%, 80%, 90%.
If you receive several alarms over a short period of time, there is something fishy in your account.
You can also set alarms for the consumption of several services.
In some environments (like dev, or test), limit the amount of executions a Lambda can run in parallel, per lambda. In our case, if we limited it up to 10 concurrencies, we would be able to divide the bill by 100.
Link : https://aws.amazon.com/about-aws/whats-new/2017/11/set-concurrency-limits-on-individual-aws-lambda-functions/
Logs and CloudWatch
CloudWatch is powerfull but it can be expensive. It’s recommended to limit the amount of data you are ingesting into CloudWatch.
You can set an appropriate log level according to the environment and to the need (don’t log your debug trace in production)
Force developers to output the bare minimum into the log trace.
Manage S3 events carefully
Learn how events are generated and carefully measure the event types and the paths where you are listening to these events. It’s better to link an S3 event on a dedicated folder for a specific file extension instead of placing the listener on the root of the bucket triggered on every object.
Evaluate other Amazon services
Amazon offers hundreds of services, sometimes some existing features or managed services offer better results in terms of efficiency/cost than developing your own services based on serverless functions. Creating a simple ETL with Lambda might be a good solution, but always compare it with services like Glue, Data Migration, that might improve your efficiency and keep your cost under control.
Tips to start the right way
The serverless world is evolving at a fast pace, with features and improvements added constantly. It’s truly a game-changer, helping developers (backend, API, DevOps, Big Data,…) to answer faster to business needs while decreasing the infrastructure complexity. To avoid some pitfalls and start serverless the right way, I recommend following the AWS Heroes Serverless Community and especially Yan Cui and James Beswick one of the AWS Serverless “Evangelists”.
AWS Heroes community : https://aws.amazon.com/developer/community/heroes/
Yan Cui website : https://theburningmonk.com/
James Beswick : https://www.linkedin.com/in/jamesbeswick/
Tunning Lambda :
AWS Lambda Power Tuning is an open-source tool that can help you visualize and fine-tune the memory/power configuration…
Active / Active multi-region DRP :