Building an Open Data Platform: Logging with Fluentd and Elasticsearch

Joel Vasallo
Redbox Tech Blog
Published in
13 min readFeb 21, 2019

At Redbox, we are working hard to rebuild our existing platform into a truly cloud native platform to support our new streaming product — Redbox On Demand. Part of this journey means rethinking our existing data platform and building a new open data platform — first stop, logging.

The Evolution of Logs

Vertical Scaling, Monoliths, Localized Logs

Logs and events written to them are the cornerstone of visibility into application health. Many issues are often solved by looking for errors and warnings in log files. Despite logging being a rather “boring” story, it is more important than ever. Now more than ever, it is important that developers and data teams work on standardized and structured log schemas.

Not your Daddy’s Log File

As application development had to evolve so did observability. In the days of vertical scaling infrastructure and monolithic applications, it was common to have localized logs files (especially if all your services lived on a single node). This made debugging the various components simple; however often times it was a battle of maintaining space for logs while also effectively rotating and compressing them. To further complicate this, as vertical scaling became increasingly difficult and costly, infrastructure teams turned to virtualized load balanced horizontal scaling architectures. With this change in development, logs also had to change.

Horizontal Scaling, Micro Services, Centralized Logging

As applications became more and more distributed, you could no longer use a single node to handle all your traffic. In addition, applications began evolving into smaller micro services architectures to enable faster development and increase fault tolerance due to work distribution. As work became distributed, so did visibility. No longer was it easy to trace events from one system to another to track correlation and causation.

It became clear that logging into a node, checking logs, and reporting errors was not going to scale in a load balanced environment — enter in centralized logging.

Centralized Logging— Like a sewer system in many ways… (Source: Schematic_of_the_Simplified_Sewer.jpg)

As logs were now distributed, many turned to the concept of centralized logging using various Open Source tools such as GrayLog and Elasticsearch to parse their log files. This was an excellent concept because it eliminated many issues and inefficiencies such as:

  1. SaaS —Fast adoption of these platforms definitely were boosted by the fact that many of them offered their Software as a Service; such as Elastic Cloud. Having a third party run the day to day scaling and performance allowed company’s to truly focus on the most important thing — the data.
  2. Security — No more RDP or SSH into running systems to check log files
  3. Event Correlation — You were now able to combine log files based on time and see the flow of traffic through your infrastructure.
  4. Visualization — These tools often came with a visual GUI (such as Kibana) enabling others in the business to help monitor and view performance (without being in the weeds of log files)

While having a centralized way to ship logs was great the issue of log rotation, disk space, and just general management of logs was still an issue. In some cases, inability to handle disk I/O could bleed into the application stack causing application level errors. To make matters worse, some solutions came with an agent of sorts to monitor and tail log files. Often times, these agents were resource intensive and required a worker to connect/register to a centralized endpoint with a “heartbeat”. With the advent of the cloud, so came the on concepts of on-demand capacity; servers can spin up on load and spin down when no longer needed. It was clear that logs once again needed to change.

On-Demand Scaling, Cloud Native Apps, and Data Streams

As you begin building modern applications with the cloud in mind, you will undoubtedly come across the term “Twelve-Factor.” If you have not, it is a highly recommended document for both developers and operations teams. A huge tenant of the creating a Twelve-Factor App is that you:

Treat logs as event streams

While at first this might seem like a foreign concept at first, after some thought it makes sense. Logs at the end of the day are streaming time-ordered events in text/JSON format with no fixed beginning or end; they continuously flow. In addition, treating your logs as streams also has some clear advantages:

  1. Logging is now consistent — local dev all the way to production. Instead of writing out to a log file, you write out to stdout in a stream. Most developers are already accustomed to monitoring a log stream in their IDEs.
  2. Logs no longer have to be large rotated log files. Enough said!
  3. Ability to route logs as data. Using tools such as Fluentd, you are able to create listener rules and tag your log traffic. Based on tags, you are then able to transform and/or ship your data to various endpoints.

With these in mind, we can began looking at various log router tools. Long story short, we chose Fluentd!

Why Fluentd?

Overall, we are really embracing the concepts of Open Source at Redbox while also looking for ways to build community — both internally and externally. For that reason, we look to great organizations (and their members) for tools and solutions such as CNCF and Apache Foundation.

While the Open Source and Community was an important factor, it was not the only driving factor. Some of the key reasons we chose Fluentd was:

  1. Docker and Kubernetes Integrations — The community has widely accepted EFK stack when it comes to container based workloads. In addition, Fluentd has a light weight C based implementation (that in our experience uses under 1MB of memory) called Fluent Bit, which is perfect for K8s. In addition, both Fluent Bit and Fluentd offer Prometheus time-series metrics making it easy to monitor!
  2. Performance and HA — At load, we noticed that Fluentd offered the best story around scaling. The Fluentd aggregator uses a small memory footprint (in our experience sub 50MB at launch) and efficiently offloads work to buffers and various other processes/libraries to increase efficiency. Less memory consumed by agents and tools meant we could give that back to our developer applications. In addition, Fluentd works great in immutable environments allowing us to replace nodes in line gracefully after buffers clear.
  3. Buffer and Retry Logic — Often times tools might fail to send logs due various reasons (such as load, backend outage, or network issues). Fluentd has a built-in buffer to write out events as they wait for output, and also built-in retry functionality. All of which is extremely customizable!
  4. JSON Schema — Makes scripting, parsing, and automation that much easier. In addition, even if logs are not native to JSON, it provides some basic structure for logs to be used in future processing. There are plenty of examples out there, but at the end of the day, it is up to developers and data teams to work together on a common logging format. In time, we might write about this as well.
  5. Pluggable Architecture — The community has already written well over 500+ plugins for data inputs, outputs, and filters. This meant less development overhead on our part while also allowing us to route our events to any where we need. Need log events to go to Elasticsearch, S3, and Kafka? No problem!
  6. Provider Freedom — the ability to use Fluentd meant we are not tied to specific vendor tools. This ultimately gives us full control over our destiny and logging architecture. If we don’t like a provider, we can easily switch to another.

Sounds great, but how does it work?

How does it work?

A simple way to get started is to leverage Fluent Bit on your nodes where logs are being generated. Fluent Bit is lightweight, portable, and highly configurable. Instead of doing heavy log transformations, just forward your logs securely to the Fluentd Aggregator cluster. There you can apply various filters/buffers/routers. Doing this allows:

  1. Centralized way to transform data consistently — logs should ideally remain consistent making backend processing in data tools easier
  2. Centralized routing — Logs can go to Elasticsearch for warm usage, but something like S3 for development or historical storage.

Now that we have a basic understanding of an overall architecture how do events get processed in Fluent?

Lifecycle of a Fluent Event

In short, there are many phases an event goes through. While it may seem intimidating at first, when broken down visually it is a bit easier to understand.

A Simple Lifecycle of an event through Fluent (Source: https://docs.fluentbit.io/manual/getting_started)
  1. Input — This the main entry point of data where data is also tagged. These are typically log file content, data over TCP, built-in metrics, etc. Check out the various built-in input plugins here: https://docs.fluentbit.io/manual/input
  2. Parser — After an event is collected, you are able to do convert unstructured data gathered into structured data based on tags. These are typically optional, but have some interesting use cases as you can see in the examples below.
  3. Filter — After an event has been collected and potentially parsed into a standard format, you are able to alter the data based on tags. For example, if you wanted to include an environment variable or some other variables to make future querying easier, it would be done here.
  4. Buffer — Basically where events live until they can be routed out.
  5. Routing — Based on tags, this is the “engine” that actually does the routing.
  6. Outputs — Based on tags, where should data go based on routing tags. These are typically forwarding data over TCP, going into a Kafka topic, etc. Check out the various built-in outputs: https://docs.fluentbit.io/manual/output

As we touched on the beginning of the article, logging and data standardization is extremely critical. By settling on an accepted log format across an organization, you can focus on visualization, processing, and log collection.

Data Collection — Fluent Bit

So all this is great, but how does this actually look in practice? Let’s start with Data Collection.

Fluent Bit is light weight and can be used for Data Collection

As seen above, Fluent Bit can run on EC2 or even inside of a pod on a Kubernetes cluster. Fluent Bit is light weight enough in can run almost anywhere you need.

The beauty of Fluent Bit is the simplicity of a configuration. Fluentd is a bit more intimidating of configuration, but that is due in part to all the additional plugins available! Sometimes, less is more!

Installation:

There are many ways to install Fluent Bit, such as Debian Packages, Docker Containers, or even compiling from source! Pick the one that works best for you from:

fluentbit.conf:

Base configuration file, sets flush intervals and log levels. In addition, adds some basic filters. Filters in Fluent allow you to alter the data of an event.

[SERVICE]
Flush 5
Log_File /var/log/fluent-bit/fluent-bit.log
Daemon on
Log_Level info
Parsers_File parsers
# Add this to every log JSON, could be useful for display/parsing
[FILTER]
Name record_modifier
Match *
Record k8s.demo.${AWS_REGION}
# Able to do wildcard inputs and outputs file names. Makes configs easier to follow and also allows drop ins too
@INCLUDE input_*.conf
@INCLUDE output_*.conf

parsers.conf:

Parsers are an important component of Fluent Bit. You can use with them to take any unstructured log entry and give them a structure that makes easier it processing and further filtering. For example:

[PARSER]
Name json
Format json
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name kube-custom
Format regex
Regex (?<tag>[^.]+)?\.?(?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$

Take a look at some more here:

input_demo.conf:

Entry point of data. Implemented through Input Plugins, this interface allows to gather or receive data. E.g: log file content, data over TCP, built-in metrics, etc.

[INPUT]
Name tail
Tag demo.k8s.log
DB /etc/fluent-bit/demo-k8s.db
Path /var/log/demo-k8s.json
Parser kube-custom

output_fluentd.conf:

An output defines a destination for the data. Destinations are handled by output plugins; in this case a simple forward. The data can be delivered to multiple destinations; if you want!

[OUTPUT]
Name forward
Match demo.*
Host aggregator.demo.com
Port 62073

Data Aggregation — Fluentd

So in the previous section, we talked about how to collect our data/logs, now how to we aggregate and process this?

Fluentd can be used for Data Processing and Aggregation

In the previous section, you saw Fluent Bit collecting data at the source and forwarding out to an endpoint via an output plugin. This is where Fluentd comes into the picture.

Fluentd is similar to Fluent Bit’s configuration, and the same lifecycle concepts apply. To best show it off, this is a simple example. In short:

  • Fluentd <system> = Fluent Bit [Service]
  • Fluentd <source> = Fluent Bit [Input]
  • Fluentd <match> = Fluent Bit [Output]

Installation:

Similar to Fluent Bit, there are many ways to install Fluentd such as Debian Packages, Docker Containers, or even from source! Pick the one that works best for you from:

fluent.conf:

This is the core file used to configure Fluentd to do what it does! There are inline comments in the section below talking about what each section does!

# System Level Settings
# ------------------------------------------------------------------
# Things like log level, root directory, and workers to use
# https://docs.fluentd.org/v1.0/articles/system-config
<system>
root_dir /etc/fluentd
log_level info
# How many workers to split work!
workers 4
</system>
# Input Plugins
# ------------------------------------------------------------------
# How will Fluentd collect data? In this case, we are listening for
# incoming logs forwarded to the port below. Refer to diagram
# showing
# https://docs.fluentd.org/v1.0/articles/input-plugin-overview
<source>
@type forward
port 62073
bind 0.0.0.0
</source>
# Output Plugins
# -----------------------------------------------------------------
# When a tag match is found based on the below match pattern, what
# should Fluentd do with the data? In our case below, send to
# Elasticsearch and Amazon S3
# https://docs.fluentd.org/v1.0/articles/output-plugin-overview
<match k8s.demo.**>
@type copy
# Send logs to elasticsearch
<store>
@type elasticsearch
# Basic Auth and Connection info
scheme https
ssl_version TLSv1_2
host elk-demo.demo.com
port 9100
user <redacted>
password <redacted>
# Built in support for logstash format as well!
logstash_format true
logstash_dateformat %Y.%m
logstash_prefix "${tag}"
# How to buffer events in terms of time, tag, and size
<buffer tag, time>
@type file
path /var/log/fluentd/es-buffer
timekey 60
flush_mode interval
flush_thread_count 4
flush_interval 60s
</buffer>
# Basic Retry functionality in the event of downstream issues
reconnect_on_error true
reload_on_failure true
reload_connections false
request_timeout 120s
retry_max_times 3
</store>
# Send logs to S3 as well
<store>
@type s3
# External compression library to help reduce load from Fluent
store_as gzip
# Ability to use IAM Instance Profile instead of keys! <3
<instance_profile_credentials>
</instance_profile_credentials>
# Note: Ability to use environment variables!
s3_bucket "k8s-demobucket-#{ENV['CLOUD_VAR']}"
path "logs/%Y/%m/%d/%H/${tag}"
# How to buffer events in terms of time, tag, and size
<buffer tag, time>
@type file
path /var/log/fluentd/s3-buffer
timekey 3600
timekey_use_utc true
chunk_limit_size 256m
</buffer>
</store></match>

TD-Agent vs Fluentd — TD-Agent-Bit vs Fluent Bit

As we wrap up this article, I do want to address some confusion when first using Fluentd. Often times you’ll see the agent “TD-Agent” tossed around in Fluentd. You are not crazy. In short, TD stands for Treasure Data. Treasure Data maintains a stable release version of Fluentd; and also conveniently creates packages too. Either work and are interchangeable. Take a look at the below diagram from the FAQ section of Fluentd.

https://www.fluentd.org/faqs

In short:

  • Fluentd == TD-Agent
  • Fluent Bit == TD-Agent Bit

Documentation

Both Fluentd and Fluent Bit have great documentation which can be found below.

We also gave a brief shout out to the folks at Elastic Cloud. They were really great to work with in our initial phase of this journey and ultimately let us focus on learning/building on Elasticsearch. Check them out at:

Takeaways

At the end of the day, there are many tools out there that can do what Fluentd does. The important thing to take away:

  • Logs have come a long way. They are no longer just used for debugging. Often times they can be used to correlate complex events across a distributed cluster as well.
  • Fluentd and Fluent Bit are two separate tools that do the same thing. The best way to describe it: Fluent Bit is light weight and only includes the bare minimum where as Fluentd is a bit heavier but has more plugins available. Check out the FAQ section for more details: https://www.fluentd.org/faqs
  • Treating logs as data streams enables you to do complex routing and data transformations to make parsing that much easier. Not only does this give the benefit of consistent data, but also alleviates some traditional operational workloads as well.

Going forward we hope to write more about our findings in using and tuning Fluentd and Fluent Bit. There is still a huge missing component of how to optimize buffers and scanning of files for log rotation for example. As you saw from the diagram above, we still have opportunities to expand on the open data platform, our developers workflow in it, and also time series metrics with Prometheus! Look for articles coming soon around these topics!

We Are Hiring!

Building a truly cloud native platform requires hard work. We are always looking for great engineers to join us and help build! Feel free to check out our Careers page for job opportunities!

--

--

Joel Vasallo
Redbox Tech Blog

Senior Director, Platform Engineering @TAG — The Aspen Group. Google Developers Group Chicago (@chicagogdg) Organizer. I automate things and build platforms.