CodeX
Published in

CodeX

Running Fluentd as a Daemonset in Kubernetes

Fluentd is an open source data collector, which lets you unify the data collection and consumption for a better use and understanding of data. Fluentd tries to structure data as JSON as much as possible: this allows Fluentd to unify all facets of processing log data: collecting, filtering, buffering, and outputting logs across multiple sources and destinations.

Fluentd architecture. Image credit: Fluentd.org

At giffgaff, we’ve chosen Fluentd as our data collector. We run Fluentd as a daemonset in our Kubernetes cluster. This setup guarantees the logs of all pods running in any of our nodes are collected and shipped to our Elasticsearch cluster. Have a look at the following article where I talk about it

Fluentd is deployed using Helm. We build our Docker image using the official image as a base, and adding some plugins on top of it that allow us to enrich our logs and parsing them correctly.

This is how our Dockerfile looks like:

FROM fluent/fluentd-kubernetes-daemonset:v1.7-debian-elasticsearch7–2
USER root
RUN fluent-gem install fluent-plugin-multi-format-parser
RUN fluent-gem install fluent-plugin-concat
RUN fluent-gem install fluent-plugin-detect-exceptions
RUN fluent-gem install fluent-plugin-rename-key

We use a simplified version of the configuration you can find here. We’ve deleted some directives we are not interested in (e.g. we’re not interested in logs generated by the masters as they’re managed by AWS and we can’t run pods there). We’ve modified some bits of the configuration though, to support, for example, multi type logs.

Understanding Fluentd Configuration

The configuration file consists of the following directives:

  • source directives determine the input sources.
  • match directives determine the output destinations.
  • filter directives determine the event processing pipelines.
  • system directives set system wide configuration.
  • label directives group the output and filter for internal routing
  • include directives include other files.

Source: where all the data come from

Fluentd’s input sources are enabled by selecting and configuring the desired input plugins using source directives. The following source tails all logs in /var/log/containers path, applies a multi format parsing:

<source>
@id fluentd-containers.log
@type tail
path /var/log/containers/*.log
pos_file /var/log/containers.log.pos
tag raw.kubernetes.*
read_from_head true
<parse>
@type multi_format
<pattern>
format json
time_key time
time_format %Y-%m-%dT%H:%M:%S.%NZ
</pattern>
<pattern>
format /^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$/
time_format %Y-%m-%dT%H:%M:%S.%N%:z
</pattern>
</parse>
</source>

Parsing all json logs might have negative implications in the performance of your Elasticsearch cluster, as the number of fields in your index could grow out of control. Have a read at my article about Elasticsearch Index Management

https://matiasmct.medium.com/elasticsearch-index-management-17b1f2b28553

match: telling fluentd what to do

The match directive looks for events with matching tags and processes them. The most common use of the match directive is to output events to other systems. For this reason, the plugins that correspond to the match directive are called output plugins.

In our case, we’re outputting all events to our Elasticsearch Cluster:

<match **>
@type elasticsearch
@id out_es
@log_level debug
include_tag_key true
host “#{ENV[‘FLUENT_ELASTICSEARCH_HOST’]}”
port “#{ENV[‘FLUENT_ELASTICSEARCH_PORT’]}”
path “#{ENV[‘FLUENT_ELASTICSEARCH_PATH’]}”
...
</match>

We can also use the match directive, for example, to discard logs you’re not interested on:

<match fluent.**>
@type null
</match>

The null output plugin just throws away events. Fluentd tries to match tags in the order that they appear in the config file, so make sure this directive goes before logs are sent to other systems

filter: Event processing pipeline

Filter plugins enable Fluentd to modify event streams. The filter directive has the same syntax as match but filter could be chained for processing pipeline.

Input -> filter 1 -> … -> filter N -> Output

Some use cases are:

  • Filtering out events by grepping the value of one or more fields.
  • Enriching events by adding new fields.
  • Deleting or masking certain fields for privacy and compliance.

As an example, we use the filter plugin apply customise regex expressions to specific applications before being passed to the match directive:

<filter kubernetes.var.log.containers.nginx-ingress-controller-**.log>
@type parser
<parse>
@type regexp
expression /^(?<host>[^ ]*) (?<host_origin>[:.\-\w]*)[, .\d]*[^ ]*(?<user>[^ ]*) \[(?<time>[^\]]*)\] \\*”(?<method>\S+)(?: +(?<path>[^\”]*?)(?: +\S*)?)?\\*” (?<code>[^ ]*) (?<size>[^ ]*) \\*”(?<referer>[^\”]*)\\*” \\*”(?<agent>[^\”]*)\\*” (?<request_length>[^ ]*) (?<request_time>[^ ]*) \[(?<proxy_upstream_name>[^ ]*)\] \[(?<proxy_alternative_upstream_name>[^ ]*)\] (?<upstream_addr>[^ ]*) (?<upstream_response_length>[^ ]*) (?<upstream_response_time>[^ ]*) (?<upstream_status>[^ ]*) (?<reg_id>[^ ]*).*$/ time_format %d/%b/%Y:%H:%M:%S %z
</parse>
key_name log
reserve_data yes
</filter>

This filter will only process log files that match the regex kubernetes.var.log.container.nginx-ingress-controller-**.log , located in the path indicated by the source directive.

system: Set system-wide configuration

System-wide configurations are set by system directive. Most of them are also available via command line options.

We’re not making use of this directive.

label: Grouping filter and output

The label directive groups filter and output for internal routing. The label reduces the complexity of tag handling.

As with the system directive, we haven’t added any custom configuration using the label directive.

@include: Reusing configuration

The directives in separate configuration files can be imported using the @include directive. For readability, our configuration is split in configMaps and then imported into our main configuration file:

@include “#{ENV[‘FLUENTD_SYSTEMD_CONF’] || ‘systemd’}.conf”
@include kubernetes.conf
@include conf.d/*.conf

td-agent

Not all of our applications are containerised and running in Kubernetes, and some are (still) running on virtual machines. To collect logs from these apps, we run td-agent on each virtual machine configured to collect both system and application logs.

td-agent is a stable distribution package of Fluentd. You can find the main differences with Fluentd here

We make use of Ansible to deploy td-agent and its configuration into VMs, depending on the application running on them and the log format of the same.

Conclusion

Logs, and in particular application logs, can contain a wide range of information that is not available otherwise. Running Fluentd as a daemonset guarantees that all logs across all of our applications are collected and persisted to Elasticsearch, and that we don’t need to worry about a pod or a node being terminated at any time.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Matías Costa

Matías Costa

SRE engineer | Technology enthusiast | Learning&Sharing