A tale on how we managed to ensure log visibility in a microservice and serverless AWS environment.
When correctly managed, logs provide organizations with strong traceability, alerting and backup capabilities. This is why organizations with appropriate log management practices are able to quickly troubleshoot and understand issues that might arise from its business-critical systems.
In this story, we will focus on how we — at Ekonoo — implemented the appropriate tolling to collect and analyze functional logs generated by our serverless infrastructure.
Where we stand
Here at Ekonoo, we have taken the decision to implement an AWS serverless architecture. This decision has been taken with one objective in mind: to provide as quickly as possible a robust and secure operational environment allowing us to release production-ready and scalable products.
In order to achieve this objective, we defined our infrastructure components as CloudFormation Infrastructure as Code files (IaC) which are automatically deployed in multiple environments via Continuous Integration and Continuous Delivery automated pipelines.
Although such infrastructure presents obvious advantages, it also creates additional complexity in specific areas such as log management. Currently, our system relies on hundreds of lambda functions distributed across many micro-services. In an AWS environment, this implies managing hundreds of active CloudWatch Log Groups (at least one Log Group per lambda function and one Log Group per infrastructure component such as an API Gateway).
AWS provides its customers with services such as CloudWatch Metrics which allow users to build real-time dashboards and alerts to monitor infrastructure components: API Gateways statuses, DynamoDB quotas, Elasticsearch Cluster health checks, etc. However, correlating, searching and interpreting logs across different functional domains is a tedious and complex job where CloudWatch does not shine.
Our objective: Provide our software engineers with appropriate visibility of cross-service logs for debug and troubleshooting purposes
The problem: How do we centralize log management in a decentralized and serverless architecture
Even though CloudWatch’s UI provides users with a great way to access a specific lambda logs, it is particularly painful to use this service when dealing with large amounts of log groups as its interface does not enable users to query through multiple Log Groups and/or Log Streams in a straightforward and user-friendly manner. This left us with the following problem: How can we easily analyze and troubleshoot a business-related issue in an environment containing hundreds of log sources distributed across multiple services?
We therefore needed to find a way to centralize hundreds of individual and uncorrelated log groups into a single location where they could be analyzed and processed. As mentioned previously, at Ekonoo we prioritize the implementation of fully managed solutions in order to avoid managing infrastructure and non-business-related issues.
When we started looking at AWS natively supported, managed and robust log management platforms, we immediately thought of using AWS’s Elasticsearch Service (ES) since it provides an integrated Kibana dashboard (even though the service has been raising some controverse lately).
The next (and last) step — detailed in this story — consisted in designing and implementating a way to automatically forward all our logs to AWS’s Elasticsearch Service. Our main requirements are described below:
• The solution should be automated as much as possible via a CloudFormation template. This provides us with the assurance that the deployed infrastructure is stable enough to pass unit tests defined in our Continuous Integration and Continuous Delivery pipelines.
• The implementation of the solution should not require us to make huge modifications to the currently deployed infrastructure and code base (lambda functions).
This is a non-trivial task as currently CloudFormation templates do not allow us to (i) specify a Log Group to a Lambda function, and (ii), configure for each Log Group a Subscription Filter in charge of forwarding logs to an Elasticsearch Cluster.
The studied approaches
As we could not natively define in CloudFormation which Log Group is associated to each lambda and which Subscription Filter should be configured for each Log Group, we started looking into ways we could automate this process.
You will find below a very quick description of the studied — but not retained — approaches. If you feel any of those would have been a viable solution, feel free to get in touch to discuss them.
I. Create a runtime extension connector for our Lambda functions
One of the studied approaches consisted in connecting to each lambda via a runtime extension (1) (2) to capture and forward the logs to our Elasticsearch cluster. This solution appeared to be quite elegant but presented the following downsides:
- This solution would have required manual development of software in charge of pushing logs to Elasticsearch which implies the creation of an index, appropriate bulk request management and error handling.
- The solution would have required us to manually modify each CloudFormation template to update the lambda functions and configure their runtime extensions.
II. Create a custom Resource to automatically create log groups and log filters on Lambda creation
This approach consisted in implementing a custom resource that would retrieve, per template, the ARNs of each lambda function and automatically create their Log Group with the appropriate Subscription Filter in charge of forwarding logs to our Elasticsearch.
This option presented some advantages:
- It can be fully automated in each CloudFormation template.
- It requires a relatively low effort to implement as we would use AWS-native services.
As well as some disadvantages:
- Implementing this custom resource organization-wide would require the modification on each template for each one of our microservices.
- The implementation of the custom resource cannot be enforced on newly created microservices and could therefore be forgotten.
- For each CF template, the custom resource would need to be manually updated each time a lambda function is added or deleted. This process is human-based and therefore error-prone (i.e., if a developer creates a new lambda function and forgets to reference it in the custom resource, its logs would not be forwarded to our ES cluster).
The selected approach: Coupling AWS EventBridge events with CloudWatch
Before explaining the implemented solution, we will briefly describe the set of actions that occur when a Lambda function is created with a CloudFormation template and writes logs during its first execution.
The below image describes the process in a straightforward way:
- Whenever a Lambda function is created (either via a CloudFormation template or manually), it does not create its associated Log Group.
- During its first execution, the Lambda function will send an event via EventBridge to create a Log Group where its logs will be stored.
- This event is caught by CloudWatch which will create the appropriate Log Group and CloudTrail which will store the request.
As our objective consisted in forwarding the logs stored on each Lambda Log Group, we needed to catch the event in charge of the creation of a new Log Group. Thanks to CloudTrail, we learned that an event shared to the account’s EventBridge service is sent whenever a Lambda Function is executed for the first time (and therefore has to create a new Log Group).
The provided event has the following structure:
As this Event contains the logGroupName parameter, we have everything we need to create a Lambda function that can be triggered whenever a Log Group is created.
Before jumping to the implementation let’s consider pros and cons of this solution.
The pros are straightforward:
- Self-contained stack: the solution reacts to events and does not create a dependency with/to other stacks.
- Low volume of code to write as we only need 2 Lambdas, and one of them only needs tweaks as it is written by AWS.
- No impact on the dev teams: the development teams are not required to change anything in their code or CloudFormation templates.
- Essentially a set and forget operation: we can set it up and forget it’s even there.
But it has some downsides too:
- Essentially a set and forget operation: we should not forget that it’s there, the teams should be aware that logs are not streamed automagically. In addition, the solution needs to be well documented to be well understood.
- Possibility to lose logs: as the forwarding is done by a Lambda function, we should handle Lambda errors and reprocess to ensure logs are not lost into oblivion.
- We need to watch carefully how the solution evolves and performs with the quantity of logs we are going to process.
The implemented solution is illustrated below:
The set of events occurring when a lambda runs for the first time and creates its first Log Group are detailed below:
- The Lambda1 function is executed for the first time and sends an event requesting to create its Log Group.
- This event is retrieved by CloudWatch. Two actions are then performed:
- A new Log Group is created and receives the logs from the Lambda1 function.
- The event triggers a Log Rule which we named LogGroupCreatedRule.
3. The LogGroupCreatedRule forwards the event to the LogGroupCreated Lambda function.
4. Upon receiving this event, a new Subscription Filter is created by the LogGroupCreated Lambda. This subscription filter is attached to the newly created Log Group and specifies that new logs need to be forwarded to a specific lambda function.
5. Whenever a new log is provided to the Log Group, it will be automatically forwarded to the LogForwarderLambda function. The source code of this function has been developed by Amazon. You can automatically generate this function by creating a Subscription Filter that forwards logs to an Elasticsearch cluster. In our case, we have reused this code to ensure that we could create the whole solution via CloudFormation.
It is worth noting that this solution can be deployed from a CloudFormation template in any account.
Our Lambda functions are defined in the Resources section of our CF template as shown below:
A few notes are documented below on the created lambdas:
- We use SAM to package and deploy our templates, which is why the resource type is set to AWS::Serverless::Function.
- We need to specify a Lambda Permission rule allowing AWS logs to invoke the LambdaLogsForwarder function.
- In the LambdaLogGroupCreated function, we specify that any event with the name “CreateLogGroup” will trigger the lambda in order to forward the logs to Kibana. If you are interested, you can also filter per log group name prefix (e.g., only forward events where the log group begins by “aws/lambda/”)
You can find below extracts of the source code of the LambdaLogGroupCreated and the LambdaLogsForwarder functions.
LambdaLogGroupCreated: The function logic in charge of the creation of a subscription filter for a specified log group is the following:
LambdaLogsForwarder: This function’s full code can be retrieved manually by creating an Elasticsearch Subscription filter on a Log group:
- Navigate to Cloud Watch Log Groups.
- Select a Log Group, navigate to its Subscription filters, select Create Elasticsearch subscription filter.
- Provide the required information to create the subscription filter and confirm the creation.
- Navigate to the Lambda console and select the newly created log forwarding function.
The function’s handler code is the following:
This solution has been implemented in our test and staging environments on the 22nd of February 2021 which basically means that there is still a big room for improvements and a lot of learning to be validated such as:
- How do we unify our logging syntax across all Lambda functions (usage of an internal logging library)?
- How can we efficiently be alerted in case of log forwarding failure? We have implemented SNS alerting at lambda levels whenever a subscription filter or log forwarding action fail to complete, however this could also be tweaked to raise alerts with CloudWatch Metrics.
- We should implement a feature allowing us to pre-process the logs and control which fields are analyzed.
- Performance and concurrency limits should be assessed to ensure that we do not exceed the number of concurrent lambdas running on an AWS account.
Please keep in mind that we have implemented this solution in order to backup and analyze business-related logs and not to monitor and manage alerts for our “infrastructure” components (i.e.: API Gateway error rates, DynamoDB states, Elasticsearch clusters health, etc.). As mentioned during the introduction, we leverage AWS integrated features such as AWS CloudWatch Metrics for our infrastructure in order to define custom alerts in CloudFormation templates. Our main objective is to allow our software engineers to take ownership of our infrastructure alerting and monitoring by making them take part to the definition and implementation of the relevant alerts for each service they use.
If you enjoyed this article, don’t forget to clap 👏 at it and share!
If you are interested to engage with us, feel free to get in touch via LinkedIn.
Ekonoo is a FinTech / AssurTech startup based in Luxembourg. We aim to disrupt the life insurance and long-term investment industry by allowing customers and organizations to efficiently invest in funds.
Xavier — CTO
Julien — AWS Architect & DevSecOps engineer
Jonathan — DevSecOps engineer, Certified AWS Solutions Architect Associate