CodeX
Published in

CodeX

Renaming/Removing log fields with Fluentd

Credit to: fluentd.org

Note: for a basic understanding of Fluentd, have a look at the following article

The Fluentd json parser plugin, one of the many Fluentd plugins, is in charge of parsing JSON logs. In combination with dynamic mapping, makes it very easy to ship logs in JSON format to an Elasticsearch cluster. This can be very useful but poses two problems: incorrect field types, and mapping explosions, as sometimes you don’t have much control over the data that you receive.

To learn more about Elasticsearch Index Management and performance, check out the following article

Incorrect field types occur when Elasticsearch assigns the wrong type to a field. For example, Elasticsearch may map duration to text or integer, when you want it to be a float, so that you can do operations with it. This can be fixed by creating an index template, and forcing a specific field to be of a specific type:

“index_patterns” : [“logstash-*”],
“mappings” : {
“properties”: {
“code”: {
“type”: “text”
},
“duration”: {
“type”: “float”
},
“exception”: {
“type”: “text”
},
....
}
}

In this article we’ll discuss some tricks to avoid the second issue, mapping explosions.

Mapping Explosions

Mapping explosions occurs when the number of fields of an index grows without control, which can cause out of memory errors and difficult situations to recover from. As mentioned before, it’s a common case when using dynamic mapping.

Consider a scenario where application logs contain request/response headers. The application makes use of a CDN or a WAF that use reverse proxies to handle incoming and outgoing traffic. The headers may contain the proxy ID that handle the request:

“req”: {
“url”: “/home”,
“headers”: {
“x-request-id”: “xxxxxxxxxxxxxxxxxxxx”,
“x-real-ip”: “xx.xx.xx.xx”,
“x-forwarded-for”: “xx.xx.xx.xx”,
“x-forwarded-host”: “www.mysite.com”,
“x-forwarded-port”: “443”,
“x-forwarded-proto”: “https”,
“x-scheme”: “https”,
“context.cdn-proxy-1”: “OK”,
...
}

If the CDN has thousands of proxies, your index will generate a new field per proxy, i.e. proxy-1, proxy-2… proxy-N.

If the value of the fields is important to you, you could rename the key part of the tuple, leaving the value untouched, making use of the rename-key plugin. (Add it to your Docker image, as explained here).

<filter kubernetes.var.log.containers.**.log>
@type rename_key
enable_ruby true
rename_rule1 cdn-proxy-(.+) cdn-proxy
rename_rule2 (\s.+) input
</filter>

This way, cdn-proxy-[N] keys will be renamed to cdn-proxy, keeping the value of the tuple available and untouched.

A second option would be completely deleting the tuple, if you’re not interested on it. In this case, we’ll make use of the record_modifier plugin within a filter:

<filter kubernetes.var.log.containers.**.log>
@type record_modifier
<record>
_remove_ ${if record.has_key?(‘req’); record[‘req’].delete(‘cdn-proxy’) ; end; nil}
</record>
remove_keys _remove_
</filter>

Note: There’s no way to apply a wildcard to delete all the keys at once, i.e. cdn-proxy-1, cdn-proxy-2, etc. So you’ll have to first rename them and then delete the single key. You can apply this filter to any unique key you’re not interested in.

And that’s it for this post. I hope you found these two tricks useful to avoid uncontrolled growth of Elasticsearch indices, and at the same time achieve a better performance of your cluster.

Thanks for reading!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Matías Costa

Matías Costa

SRE engineer | Technology enthusiast | Learning&Sharing