Centralizing log management in a distributed AWS infrastructure

Jonathan Bernales
Mar 19 · 10 min read

A tale on how we managed to ensure log visibility in a microservice and serverless AWS environment.

When correctly managed, logs provide organizations with strong traceability, alerting and backup capabilities. This is why organizations with appropriate log management practices are able to quickly troubleshoot and understand issues that might arise from its business-critical systems.

In this story, we will focus on how we — at Ekonoo — implemented the appropriate tolling to collect and analyze functional logs generated by our serverless infrastructure.

Where we stand

Here at Ekonoo, we have taken the decision to implement an AWS serverless architecture. This decision has been taken with one objective in mind: to provide as quickly as possible a robust and secure operational environment allowing us to release production-ready and scalable products.

In order to achieve this objective, we defined our infrastructure components as CloudFormation Infrastructure as Code files (IaC) which are automatically deployed in multiple environments via Continuous Integration and Continuous Delivery automated pipelines.

Although such infrastructure presents obvious advantages, it also creates additional complexity in specific areas such as log management. Currently, our system relies on hundreds of lambda functions distributed across many micro-services. In an AWS environment, this implies managing hundreds of active CloudWatch Log Groups (at least one Log Group per lambda function and one Log Group per infrastructure component such as an API Gateway).

AWS provides its customers with services such as CloudWatch Metrics which allow users to build real-time dashboards and alerts to monitor infrastructure components: API Gateways statuses, DynamoDB quotas, Elasticsearch Cluster health checks, etc. However, correlating, searching and interpreting logs across different functional domains is a tedious and complex job where CloudWatch does not shine.

Our objective: Provide our software engineers with appropriate visibility of cross-service logs for debug and troubleshooting purposes

The problem: How do we centralize log management in a decentralized and serverless architecture

Even though CloudWatch’s UI provides users with a great way to access a specific lambda logs, it is particularly painful to use this service when dealing with large amounts of log groups as its interface does not enable users to query through multiple Log Groups and/or Log Streams in a straightforward and user-friendly manner. This left us with the following problem: How can we easily analyze and troubleshoot a business-related issue in an environment containing hundreds of log sources distributed across multiple services?

CloudWatch UI showing the Log streams associated to one Log Group

We therefore needed to find a way to centralize hundreds of individual and uncorrelated log groups into a single location where they could be analyzed and processed. As mentioned previously, at Ekonoo we prioritize the implementation of fully managed solutions in order to avoid managing infrastructure and non-business-related issues.

When we started looking at AWS natively supported, managed and robust log management platforms, we immediately thought of using AWS’s Elasticsearch Service (ES) since it provides an integrated Kibana dashboard (even though the service has been raising some controverse lately).

Kibana interface captures — from https://www.elastic.co/kibana

The next (and last) step — detailed in this story — consisted in designing and implementating a way to automatically forward all our logs to AWS’s Elasticsearch Service. Our main requirements are described below:

• The solution should be automated as much as possible via a CloudFormation template. This provides us with the assurance that the deployed infrastructure is stable enough to pass unit tests defined in our Continuous Integration and Continuous Delivery pipelines.

• The implementation of the solution should not require us to make huge modifications to the currently deployed infrastructure and code base (lambda functions).

This is a non-trivial task as currently CloudFormation templates do not allow us to (i) specify a Log Group to a Lambda function, and (ii), configure for each Log Group a Subscription Filter in charge of forwarding logs to an Elasticsearch Cluster.

The studied approaches

As we could not natively define in CloudFormation which Log Group is associated to each lambda and which Subscription Filter should be configured for each Log Group, we started looking into ways we could automate this process.

You will find below a very quick description of the studied — but not retained — approaches. If you feel any of those would have been a viable solution, feel free to get in touch to discuss them.

I. Create a runtime extension connector for our Lambda functions

One of the studied approaches consisted in connecting to each lambda via a runtime extension (1) (2) to capture and forward the logs to our Elasticsearch cluster. This solution appeared to be quite elegant but presented the following downsides:

  • This solution would have required manual development of software in charge of pushing logs to Elasticsearch which implies the creation of an index, appropriate bulk request management and error handling.

II. Create a custom Resource to automatically create log groups and log filters on Lambda creation

This approach consisted in implementing a custom resource that would retrieve, per template, the ARNs of each lambda function and automatically create their Log Group with the appropriate Subscription Filter in charge of forwarding logs to our Elasticsearch.

This option presented some advantages:

  • It can be fully automated in each CloudFormation template.

As well as some disadvantages:

  • Implementing this custom resource organization-wide would require the modification on each template for each one of our microservices.

The selected approach: Coupling AWS EventBridge events with CloudWatch

Before explaining the implemented solution, we will briefly describe the set of actions that occur when a Lambda function is created with a CloudFormation template and writes logs during its first execution.

The below image describes the process in a straightforward way:

  1. Whenever a Lambda function is created (either via a CloudFormation template or manually), it does not create its associated Log Group.

As our objective consisted in forwarding the logs stored on each Lambda Log Group, we needed to catch the event in charge of the creation of a new Log Group. Thanks to CloudTrail, we learned that an event shared to the account’s EventBridge service is sent whenever a Lambda Function is executed for the first time (and therefore has to create a new Log Group).

The provided event has the following structure:

Image generated with https://carbon.now.sh

As this Event contains the logGroupName parameter, we have everything we need to create a Lambda function that can be triggered whenever a Log Group is created.

Before jumping to the implementation let’s consider pros and cons of this solution.

The pros are straightforward:

  • Self-contained stack: the solution reacts to events and does not create a dependency with/to other stacks.

But it has some downsides too:

  • Essentially a set and forget operation: we should not forget that it’s there, the teams should be aware that logs are not streamed automagically. In addition, the solution needs to be well documented to be well understood.

The implemented solution is illustrated below:

Image generated with https://carbon.now.sh

The set of events occurring when a lambda runs for the first time and creates its first Log Group are detailed below:

  1. The Lambda1 function is executed for the first time and sends an event requesting to create its Log Group.
  • A new Log Group is created and receives the logs from the Lambda1 function.

3. The LogGroupCreatedRule forwards the event to the LogGroupCreated Lambda function.

4. Upon receiving this event, a new Subscription Filter is created by the LogGroupCreated Lambda. This subscription filter is attached to the newly created Log Group and specifies that new logs need to be forwarded to a specific lambda function.

5. Whenever a new log is provided to the Log Group, it will be automatically forwarded to the LogForwarderLambda function. The source code of this function has been developed by Amazon. You can automatically generate this function by creating a Subscription Filter that forwards logs to an Elasticsearch cluster. In our case, we have reused this code to ensure that we could create the whole solution via CloudFormation.

It is worth noting that this solution can be deployed from a CloudFormation template in any account.

Our Lambda functions are defined in the Resources section of our CF template as shown below:

Image generated with https://carbon.now.sh

A few notes are documented below on the created lambdas:

  • We use SAM to package and deploy our templates, which is why the resource type is set to AWS::Serverless::Function.

You can find below extracts of the source code of the LambdaLogGroupCreated and the LambdaLogsForwarder functions.

LambdaLogGroupCreated: The function logic in charge of the creation of a subscription filter for a specified log group is the following:

Image generated with https://carbon.now.sh

LambdaLogsForwarder: This function’s full code can be retrieved manually by creating an Elasticsearch Subscription filter on a Log group:

  • Navigate to Cloud Watch Log Groups.

The function’s handler code is the following:

Image generated with https://carbon.now.sh

Final thoughts

This solution has been implemented in our test and staging environments on the 22nd of February 2021 which basically means that there is still a big room for improvements and a lot of learning to be validated such as:

  • How do we unify our logging syntax across all Lambda functions (usage of an internal logging library)?

Please keep in mind that we have implemented this solution in order to backup and analyze business-related logs and not to monitor and manage alerts for our “infrastructure” components (i.e.: API Gateway error rates, DynamoDB states, Elasticsearch clusters health, etc.). As mentioned during the introduction, we leverage AWS integrated features such as AWS CloudWatch Metrics for our infrastructure in order to define custom alerts in CloudFormation templates. Our main objective is to allow our software engineers to take ownership of our infrastructure alerting and monitoring by making them take part to the definition and implementation of the relevant alerts for each service they use.

If you enjoyed this article, don’t forget to clap 👏 at it and share!

If you are interested to engage with us, feel free to get in touch via LinkedIn.

About us:

Ekonoo is a FinTech / AssurTech startup based in Luxembourg. We aim to disrupt the life insurance and long-term investment industry by allowing customers and organizations to efficiently invest in funds.

Xavier — CTO

Julien — AWS Architect & DevSecOps engineer

Jonathan — DevSecOps engineer, Certified AWS Solutions Architect Associate

Ekonoo Tech & Finance

Learn more about how Ekonoo plans to disrupt the savings and investment industry

Ekonoo Tech & Finance

Ekonoo is a FinTech / InsurTech startup based in Luxembourg. We aim to disrupt the medium and long term (occupational, and individual) savings industry through a fully digital investment solution across the whole value chain.

Jonathan Bernales

Written by

I am a Dev(Sec)Ops Engineer at Ekonoo. I am involved in the design and implementation of an AWS serverless architecture and robust CI/CD pipelines.

Ekonoo Tech & Finance

Ekonoo is a FinTech / InsurTech startup based in Luxembourg. We aim to disrupt the medium and long term (occupational, and individual) savings industry through a fully digital investment solution across the whole value chain.