AWS Fargate Logs Pipeline- Let’s Log It Right!
It was like every other Monday — the gears were turning and the data was flowing. But then, the logs were gone. Our developers and support were blind, waging war against the machine, and we — The DevOps Team — had to ssh our way into the belly of the beast. We lost many logs that day, after that day, change was inevitable.
In this article, I will present the thought process and lessons we learned by implementing Amazon Elasticsearch Service and Amazon Kinesis Data Firehose at HiredScore as our new logging solution for our containerized environment (AWS Elastic Container Service).
Introduction
As microservices are becoming more popular and solutions like Kubernetes and Elastic Container Service (ECS) become more popular, the need for a faster and more efficient logging system increases.
As HiredScore’s product and client base grew larger, we faced a new technological challenge to service a much larger client base and develop more services as part of the product growth. We needed high log visibility for faster troubleshooting and development. Processing and parsing logs manually were no longer an option, we needed to start writing logs in JSON format making the pipeline more simple and enable us to send logs directly to Elasticsearch.
The solution at the time was to have logs written in CloudWatch (the best practice offered by Amazon) which are then pulled over to an on-premise ELK stack (installed on EC2 instances) using the “logstash-input-cloudwatch” plugin for Logstash which runs over a single Logstash container.
As HiredScore grew larger we started to notice that our single Logstash instance could not keep up with the large number of logs sent from our Fargate services in ECS as there were about 48 services at the time and growing.
We faced a crossroad with two possible routes to take:
- We take the blue pill — the story ends, we wake up in our office chair and believe whatever we want to believe while we continue using Logstash by making Logstash auto-scalable.
- We take the red pill — we stay in Wonderland, and see how deep the rabbit hole goes by eliminating Logstash and sending the logs directly to Elasticsearch.
We understood that migrating to a managed Elasticsearch solution combined with removing Logstash will result in having fewer “moving parts” and the bottleneck issue will be resolved.
Assumptions & Considerations
- There are ~100M logs records per day
- Each record is ~1.5KB
- Logs should be retained for a few months
- Logs older than 2 weeks would be deleted from Elasticsearch
- Should be scalable
- The cost should remain under the existing budget
- Should be easy to maintain
Quota Limits
- Each Kinesis Data Firehose delivery stream provides the following: 5,000 records/second, 2,000 requests/second, and 5 MiB/second.
- This Kinesis quota can be increased in case of need by filing a request to AWS.
Frequency of Delivery:
Kinesis Data Firehose buffers incoming data before delivering it to Amazon ES. The frequency of data delivery to Amazon Elasticsearch is determined by the Elasticsearch Buffer size and Buffer interval values that will be configured for the delivery stream.
The winning solution
AWS Firelens with Fluentbit sidecar. With this option, we can push the logs directly to AWS Elasticsearch Service without the need for Logstash using a fully managed AWS solution which will allow us to scale at any point as needed.
AWS for Fluent bit — Output logs to:
- Amazon CloudWatch Logs
- Amazon Kinesis Data Firehose
- Amazon Kinesis Data Streams
- All destinations supported natively in Fluent Bit
The Dataflow Architecture
The main architecture is based on Fluentbit as a side-car for the Fargate task application container.
Fluentbit has a very low CPU/Mem signature and has many capabilities to filter/parse the streamed data.
In the diagram below we can see that:
- Fluentbit resides as a side-car with the Service in the task.
- Service logs are written to the STDOUT.
- STDOUT is available to the Fluentbit (via docker Fluntd log drive) which streams them directly to the Data Firehose.
- Data Firehose streams the Service logs to Elasticsearch
- Firehose, Elasticsearch, Fluentbit writes their own logs to the CloudWatch
Sound great, so how do we actually do it?
Configure Fluentd/Fluent Bit as if it was a log driver:
"logConfiguration": { "logDriver":"awsfirelens", "options": { "Name": "firehose", "region": "us-west-2", "delivery_stream": "My-stream" }}
Add a side-car to your Task Definition:
{ "essential": true, "image": "amazon/aws-for-fluent-bit:latest", "name": "log_router", "firelensConfiguration": { "type": "fluentbit" }}
Putting it all together
This integration creates a new container that will run your image to send your AWS Fargate logs to Elasticsearch using FireLens.
Next, we need to create our ECS task and use our Firehose to ship the logs:
Build the task execution role:
Create a new role in the IAM console.
- Select AWS service. It should already be selected by default.
- Under Choose a use case:
- Select Elastic Container Service
- Select Elastic Container Service Task (scroll to the bottom of the page to see it.)
- Click Next: Permissions to continue.
- Select AmazonECSTaskExecutionRolePolicy.
- Click Next: Tags and then Next: Review
- Set Role name to testEcsTaskExecutionRole, then click Create role to save.
Your new role should now be created.
- Click the newly created role to go to its Summary page.
- Copy the Role ARN (at the top of the page) and save it for the deployment JSON.
Create a Fluentbit task definition:
In the ECS console, open the Task Definitions page.
- Select Create new Task Definition.
- Choose Fargate, and click “Next” to continue.
- Scroll to the bottom of the page to the Volumes section, and select Configure via JSON.
- Replace the default JSON with this code block:
{ "family": "test-fargate-task", "requiresCompatibilities": [ "FARGATE" ], "containerDefinitions": [ { "name": "test-log-router", "image": "amazon/aws-for-fluent-bit:latest", "essential": true, "firelensConfiguration": { "type": "fluentbit", "options": { "config-file-type": "file", "config-file-value": "/fluent-bit/configs/parse-json.conf" } }, "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/aws/ecs/test-fargate-logs", "awslogs-region": "<<YOUR-AWS-REGION>>", "awslogs-create-group": "true", "awslogs-stream-prefix": "aws/ecs" } } },{ "name": "app", "essential": true, "image": "<<YOUR-APP-IMAGE>>", "logConfiguration": { "logDriver": "awsfirelens", "options": { "delivery_stream": "test-firehose", "region": "<<YOUR-AWS-REGION>>", "Name": "firehose" } }}],"cpu": "256","executionRoleArn": "arn:aws:iam::<<AWS-ACCOUNT-ID>>:role/testEcsTaskExecutionRole","memory": "512","volumes": [ ],"placementConstraints": [ ],"networkMode": "awsvpc","tags": [ ]}
Pay Attention to replace the placeholders in the code block (indicated by the double angle brackets << >>)
Verify that the logs appear in AWS Elasticsearch
Give your logs some time to get from your service to Elasticsearch, and then open Kibana. You’ll be able to find these logs by searching for type: fargate
Other pipeline alternatives considered
We listed all of the possible solutions and discussed the pros and cons of each one. Click on the table below to review:
I hope you enjoyed this post, and that it inspires you to rethink and improve your container logging process! 🪵 ✌️️
Interested in this type of work? We’re always looking for talented people to join our team!
Thanks to Avner Cohen, Regev Golan, Yossi Cohn, Tal Suhareanu, and Ezra Wanetik.