Centralized Logging with Firelens, Firehose & Elasticsearch

Vinay Nadig
7 min readApr 15, 2020

--

At Joyn, we have dozens of services spread over multiple AWS Accounts. Each of these services generates thousands to tens of millions of log entries per day which have to be accessible to developers quickly to debug applications. Every service also generates logs at a different rate and we needed a solution that would let us manage log shipping with minimum maintenance.

Most of our services are based on AWS Fargate and our runtime of choice is Node.js. Our initial logging setup consisted of the standard ELK stack with Logstash running as a separate Fargate service. While this setup has mostly worked for us, we occasionally noticed Logstash lagging when services produced large quantities of logs. Dynamically scaling Logstash to accommodate for these log generation spikes required additional effort.

We started looking out for a new Log aggregation solution that is cost-effective while remaining low-maintenance. Before that though, we had a chat with our developers to figure out which parts of the current logging system work well for them and which don’t. The feedback we got about the existing log shipping solution was quite positive though we knew there were areas of improvement. The positive feedback was around the following areas.

  1. Developers overwhelmingly preferred the current Elasticsearch and Kibana interface to query logs. This is no surprise as both Elasticsearch as well as Kibana are mature products and provide powerful querying capabilities over large data sets.
  2. Developers did not want to be bothered with the specifics of logging and just wanted to “ship logs” without worrying about the medium through which the logs are shipped. The most preferred solution was to just log to standard out and let the supporting infrastructure figure out the best way to ship logs to the Elasticsearch cluster.
  3. Developers would like to query over a longer time range and occasionally, they would like to query logs that are months old.

In addition, since we have a small SRE team, we were also looking for a solution that we could set up once and not have to maintain over the long term.

With the above considerations, we came up with the following set of requirements.

Requirements

  1. Multi AWS Account support
  2. Robust authentication mechanism for Kibana access
  3. Ability to control index retention policies for individual services depending on their log size
  4. Long term Log retention support
  5. Automatic index rotation
  6. Low operational overhead.

Architecture

AWS Architecture

The architecture consists of an Elasticsearch Cluster that is located in a central AWS Account. The Elasticsearch cluster is not in a custom VPC as Firehose to Elasticsearch integration is only available for non-vpc Elasticsearch clusters as documented here. The Elasticsearch Cluster API access is protected through Elasticsearch Access Policy and the Kibana access is protected by Cognito Integration.

The architecture follows the sidecar pattern where a single Fluentbit container runs in every Fargate Task with the sole purpose of aggregating and shipping logs from other application containers in the same Task. The logs are shipped to a Firehose Delivery stream which manages the eventual delivery of logs to Amazon Elasticsearch. The Fluentbit to Firehose integration is handled through AWS Firelens feature that was released recently.

Every Fargate Service that needs to ship logs to the Elasticsearch Cluster must implement the following components:

  • A sidecar container running the Fluentbit image provided by AWS. The image URI is region specific. The container definition should have FirelensConfiguration field. The log driver for the fluentbit container is preferably set to awslogs. An example configuration is provided below. The URI for each region can be found here.
- Name: fluentbit-container
FirelensConfiguration:
Type: fluentbit
Options:
enable-ecs-log-metadata: true
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: firelens-container
awslogs-region: us-east-1
awslogs-create-group: true
awslogs-stream-prefix: firelens
Image: 906394416424.dkr.ecr.us-east-1.amazonaws.com/aws-for-fluent-bit:latest
  • Every container that would like to ship logs should specify the LogDriver as awsfirelens. In addition, the region and name of the Firehose stream should also be provided. An example configuration is provided below.
LogConfiguration:
LogDriver: awsfirelens
Options:
Name: firehose
region: us-east-1
delivery_stream: LogShippingDeliveryStream

Once the Fargate task is started, the logs from the application containers are buffered by the Fluentbit container before being pushed to Firehose Delivery Stream through the “PutRecordBatch” API.

Each Firehose Delivery Stream can deliver the logs to one of the following destinations — Elasticsearch, S3 or Redshift. For the Elasticsearch & Redshift destination, it is also possible to back up all records to S3 for archival purposes. It is also possible to back up only those records that fail to get delivered to Elasticsearch.

The index to which the records have to be delivered has to be specified at the time of Firehose Delivery Stream creation. Firehose can also handle index rotation automatically.

Components

  1. Fluentbit Container — The sidecar container that sits in every Fargate task and routes logs to Amazon Firehose. The logs of the Fluentbit container itself are routed to Cloudwatch.
  2. Firehose Delivery Stream — Logs from different containers are routed to different Firehose Delivery Streams. Each Delivery Stream writes to a single index in Elasticsearch.
  3. S3 — A S3 bucket can be used to back up either all records or records that fail to be delivered to Elasticsearch. Lifecycle policies can also be set to auto-archive logs.
  4. Cloudwatch Logs — The logs from the Fluentbit container are routed to Cloudwatch logs as opposed to Elasticsearch. This is to make sure that if there are any issues with log shipping, we don’t end up in a situation where we are not able to debug the issue due to the logs being unavailable on Elasticsearch.
  5. Elasticsearch — Elasticsearch hosted by Amazon which can be located in a separate AWS Account.
  6. Cognito — Cognito User Pool & Identity Pool is used to control Kibana access.
  7. IAM — Cross Acccount IAM policies are used to control Elasticsearch API access.

Scaling

Firehose

By default, each Firehose Delivery stream can scale up to 5000 records/second or 5 MB/second(whichever limit is reached first) for Ireland region. This is a soft limit that can be raised up to 10000 records/second or 10 MB/second. The limits are region specific. Be sure to check the region specific limits here.

As per AWS Support, Firehose can scale beyond the 10000 records/second and 10 MB/second limit as well. A Service Limit increase request has to be raised which will be considered on a case by case basis.

Elasticsearch

The Elasticsearch cluster itself can be scaled both in terms of storage and compute power on AWS. Version upgrades are also possible. More information can be found on the AWS Elasticsearch documentation.

Cost

The following components contribute to the total cost of this setup.

Firehose

The pricing is calculated as number of records * size of each record rounded off to nearest 5KB. Because the pricing is purely based on number of records and record size, there is no hourly price for each Firehose stream created. More information on Firehose pricing can be found here.

For example, if you ship a billion log messages of 3 KB each(A typical json formatted nginx log is less than 2 KB in size for reference), the total cost would be

Total Record Size in KB = 5 KB * 1000000000 Records = 5000000000 KB Record Size in GB = 5000000000/(1024*1024) = 4769 GB (Rounded Off) Cost of shipping 4.769 TB per month = 4769*0.031 = 148 USD in the Ireland region

S3

If you enable the option to back up records to S3, standard prices for S3 requests and S3 storage apply — https://aws.amazon.com/s3/pricing/

Elasticsearch

Elasticsearch pricing is based on the size of the instances of the cluster and reserved instance pricing is available as well — https://aws.amazon.com/elasticsearch-service/pricing/

Advantages

  1. The pipeline is effectively completely managed and requires very low maintenance.
  2. Kibana authentication has always been a pain point while using Elasticsearch. With Cognito integration supported out of the box, this is no longer the case.
  3. Implementing an Archival strategy for logs is trivial with S3 storage & lifecycle policies.
  4. Log transformation before delivery to Elasticsearch is supported through Lambda though pricing has to be considered carefully in such cases.
  5. Durability — Even if the Elasticsearch cluster fails, Firehose can retain records up to 24 hours. In addition, records that fail to deliver is also backed up to S3. The chances for data loss is low with these options available.
  6. Fine grained access control is possible to both Kibana & Elasticsearch API through IAM policies.
  7. Fluentbit can be configured to ship logs to any platform supported by fluentbit. The logs don’t necessarily have to be shipped to Firehose.

Shortcomings

  1. Pricing has to be carefully considered and monitored. Firehose can handle large amounts of data ingestion with ease and if an application goes rogue and starts logging large amounts of data, Firehose will deliver them without issues. This can incur large costs.
  2. The Firehose to Elasticsearch integration is only supported for non-vpc Elasticsearch clusters.
  3. Firehose currently cannot deliver logs to Elasticsearch clusters that are not hosted by AWS. If you would like to self-host Elasitcsearch clusters, this setup will not work.
  4. Cloudformation support for Firehose to Elasticsearch integration is not present currently. The Firehose Delivery Stream has to be created through CLI or through SDKs.

Conclusion

If you are looking for a solution that is completely managed and (mostly)scales without intervention, this would be a good option to consider. The auto backup to S3 with lifecycle policies also solves the log retention and archival problem easily.

If you would like to ship logs to an externally/self hosted Elasticsearch cluster or if you are unwilling to pay the overhead of managed solutions, this solution may not be the best option for you.

Another consideration to keep in mind is that a buffering system between your application and Elasticsearch is not always necessary. For low traffic systems, you can effectively bypass Firehose or any other buffering system. In such cases, Fluentbit or your application can directly push the logs to any Elasticsearch cluster or third party system. You can also consider this as an option if you have very low durability requirements on your logs.

--

--