Last year, I designed and co-wrote a public facing Single Page Application and API using completely serverless technologies on AWS. The stack comprised of AWS Lambda and AWS API Gateway for the middleware and AWS Elastic Search, AWS Step Functions and AWS Dynamo DB at the backend. The architecture looked like the below, with some information redacted.
The infrastructure hosts about 160 Million instruments reference records and most searches complete within 4–5 seconds. The user submits search terms, which flow through to an AWS Lambda. The lambda constructs ES queries, injects a few business-sensitive filters and submits the queries onto an Elastic Search server. AWS Lambda functions are also used to load over a million records every day onto ES using a Step Functions driven batch process. The Lambda function were all written in Java, using popular open-source libraries like Elastic Search Java client, commons HTTP client etc.
Having run this application for about a year, there are a few lessons learnt, which I shall share with you.
When things go wrong, RCA is hard.
About 6 months in, we had a strange problem with the ES Java Client not being able to connect to the backend ES servers on an intermittent basis. At first, we thought that this was an AWS issue. We got into long calls with AWS support (who are excellent) but they concluded that there were no issues on either the ES instances or the load balancers fronting the service. We started digging deep into the ES Client code and HTTP libraries supporting the same. We then changed our safety-first approach of opening and closing HTTP connections on every request and opted to keep the connection open. This solved the intermittent issue but raised another odd problem. The first request to the function always failed to connect to ES. I tried all tricks including adding a static block of code that would establish the connection as soon as the Java class is loaded. Didn’t work.
Ordinarily, this is the point where I would start looking at resolving connection leakages using tools such as netstat, lsof and a few other JVM tools. With Serverless this isn’t possible. Even the AWS support folks don’t have access to the containers that run the customer’s Lambda functions.
This is where Serverless is still immature. Diagnosis is limited. Insights are only superficial.
Coding to Serverless only was a mistake
Going back to the previous problem. I was wrong in coding the application in a manner that it would only run on AWS Lambda. If it were written in a way that the same code could be deployed on an EC2 instance or an ECS container, then debugging using tools such as netstat would be possible.
That’s the key. While it is easy to quickly develop applications utilising serverless frameworks, not being to run the same code elsewhere is a problem waiting to happen. Adopting clean architecture principles of abstracting away the core use cases of the application from the surrounding interfaces, is the way to go.
Serverless has a problem with State
The underlying container can be re-used between invocations. Therefore, the Lambda programming model is analogous to the Java Servlets model. Globals variables are bad, locals are fine. But how does one efficiently manage resources such as socket connections? Should we create and close all HTTP and JDBC connections per request? That’s not efficient. The thing is, resource management is hard. The more resources a Lambda needs to fulfil a request the trickier it gets to manage those resources.
AWS has begun to recognise this as a problem and things like RDS Proxy is a step in the right direction https://aws.amazon.com/blogs/compute/using-amazon-rds-proxy-with-aws-lambda/.
However, we need a similarly managed proxies for making HTTP calls as well or Amazon could allow developers to use their own HTTP Client library.
AWS Lambda is not the only option
With ECS and EKS maturing, it is not a bad idea to start with the ECS EC2 launch type, monitor the resource usage of an application over time, tune it, and then finally deploy it to an ECS Fargate launch type. Why not AWS Lambda? At runtime, both Fargate and Lambda have the same risks and opportunities.
For me, the real question is; Am I running a microfunction or a microservice?
The Lambda programming model is not fit for everything.
The Lambda programming model and the underlying infrastructure encourages us to decompose systems into a collection of lightweight functions that act upon a limited data domain. For e.g it is best used for one-off tasks like image processing, applying data validations, data transformation or for translating and mediating HTTP requests.
Modelling an entire domain with multiple aggregates, entities and use cases and then shoe-horning those into a single lambda function is a very bad idea. One can indeed build a Lambda application which is essentially a collection of Lambda functions, but why would I do that? Why would I want to break up the 6 Java methods supported by my Service class into 6 individual lambda functions? Why wouldn’t I deploy the whole domain as a single microservice using a Docker container as opposed to deploying 6 different microfunctions?