How TrueCar Uses Kafka for High Volume Logging Part 2
In my previous blog entry, I discussed configuring and setting up a production Kafka cluster to queue log messages. That article purposely started in the “middle” of the pipeline to describe how our logs make their way from an application server all the way to our operations and development users. This article will backup one step in the pipeline to show how we actually put the log entries into the queue. Here’s a flowchart that describes the process and where we are:
The rsyslog project is a “Rocket-Fast System for Log Processing”. Rsyslog is based on the original functionality that the standard “syslog” package that most OS platforms ship with. There are many other options that are popular, including “syslog-ng” the “Next Generation syslog”. There are also other log shipping tools that are syslog-compatible. Another popular tool is Logstash from Elastic. You can even see in my flowchart above that we do use Logstash later on in the pipeline, and it will be featured in an upcoming blog post!
Any log system that an administrator wants to use, or is already using, is not the issue here. We do have other options and special-use cases for logging that use a different transfer method. I’m also sure that readers of this post will come up with their own tool or platform for processing logs that would work great. But the main focus of this article is how TrueCar implemented log transfers using Kafka. Here are several reasons why we chose rsyslog for the majority of our log processing and forwarding:
- rsyslog is fast
- rsyslog is very efficient and lightweight
- rsyslog is compatible with standard syslog messages
- rsyslog has a good variety of file ingestion methods and application connectors
- rsyslog has good features for internal queuing in memory and on disk
- rsyslog can perform quality-of-service decisions for dropping messages when the queues are disconnected or get too full
- rsyslog has a native kafka output
- rsyslog supports TLS (encryption), relp (reliable delivery) and other “modern” and “enterprise” features
But there are some drawbacks to using rsyslog:
- rsyslog has a lot of bugs
- rsyslog has shown stability problems in our environment (see below)
- rsyslog configuration is daunting
- rsyslog documentation is terrible
- The are probably better methods for processing application logs directly so that developers can have more control and choices on how they produce logs
The key reasons that we chose rsyslog are because it was fast and relatively simple to integrate with Kafka. The main reason we had difficulties using rsyslog were because it was unstable and hard to configure.
Gettting Started with Rsyslog
rsyslog is not native to most OS platforms, but it should be available as an option. The good news is that rsyslog can be used directly in place of syslog if that is the default on your OS platform. A simple rsyslog configuration file is nearly identical to a basic syslog configuration file. Unfortunately, most OS platforms use very old versions of rsyslog because the latest versions have been unstable. Here are the minimum requirements:
- OMKAFKA (the Output Module for Kafka) was released in v8.7
- There are serious bugs affecting OMKAFKA under production loads that might not be fixed, even in v8.13
I recommend taking the absolutely latest version 8-stable packages for testing in your environment. You will probably need the following packages if you are following my example (you can install more or fewer packages for your requirements):
- rsyslog (base, includes imfile)
- rsyslog-kafka (omkafka)
- asdiscon-librdkafka1 (kafka client libraries, mentioned in my previous post)
Following is a complicated example that we use. It will work on the CentOS platform when the contents are stored in the /etc/rsyslog.conf file:
# rsyslog configuration file
# note that most of this config file uses old-style format,
# because it is well-known AND quite suitable for simple cases
# like we have with the default config. For more advanced
# things, RainerScript configuration is suggested.
# For more information see /usr/share/doc/rsyslog-*/rsyslog_conf.html
# If you experience problems, see http://www.rsyslog.com/doc/troubleshoot.html
# Debug options, uncomment if needed
# Order is extremely important, for example, errors before preservefqdn will
# be shortened unless this is first.
# Order is extremely important, for example, modules will take the default
# 2KB message size if not listed above module imports. Messages will be
# TRUNCATED to this size
#### MODULES ####
module(load="imuxsock") # provides support for local system logging (e.g. via logger command)
module(load="imklog") # provides kernel logging support (previously done by rklogd)
#module(load"immark") # provides --MARK-- message capability
# Provides UDP syslog reception
# for parameters see http://www.rsyslog.com/doc/imudp.html
#module(load="imudp") # needs to be done just once
# Provides TCP syslog reception
# for parameters see http://www.rsyslog.com/doc/imtcp.html
#module(load="imtcp") # needs to be done just once
#### GLOBAL DIRECTIVES ####
# Use default timestamp format
# File syncing capability is disabled by default. This feature is usually not required,
# not useful and an extreme performance hit
# Include all config files in /etc/rsyslog.d/
#### RULES ####
# Log all kernel messages to the console.
# Logging much else clutters up the screen.
# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
# The authpriv file has restricted access.
# Log all the mail messages in one place.
# Log cron stuff
# Everybody gets emergency messages
# Save boot messages also to boot.log
# ### begin forwarding rule ###
# The statement between the begin ... end define a SINGLE forwarding
# rule. They belong together, do NOT split them. If you create multiple
# forwarding rules, duplicate the whole block!
# Remote Logging (we use TCP for reliable delivery)
# An on-disk queue is created for this action. If the remote host is
# down, messages are spooled to disk and sent when it is up again.
$WorkDirectory /var/lib/rsyslog # where to place spool files
$MainMsgQueueMaxDiskSpace 53687091 # 512KB, most containers have
Reading log files into Rsyslog
Some of the “fancy sauce” that we use is the IMFILE “Input Module for Files” directive and the “$MainMsgQueue” directives. These allow rsyslog to queue messages internally on disk and in memory for extremely busy servers.
The above configuration is a fairly straightforward “syslog”-style message handler. Most of TrueCar’s application logs get generated on disk or with standard output. In most cases, the log files are human-readable text. Rsyslog and logstash will also handle JSON output. Some of our application logs support JSON output. We have found that standard delimited fields are easier to read and handle than JSON at the moment. When we get to logstash processing in a future blog post, we can go over the JSON formatted-log standards.
For now, we just read in an application’s log files by placing a file in /etc/rsyslog.d/application.conf (you can use any location and specify it in the $IncludeConfig directive above:
:msg, contains, “INFO” stop
:msg, contains, “TRACE” stop
I’ve cleverly included an example that parses kafka’s own output logs. You can see that we filter certain message at the source (before sending to our kafka queue) with the “stop” directive. The “tag” directive becomes important when we get to the logstash post-processing.
You can include as many of these files in your /etc/rsyslog.d directory (or wherever you put these) to monitor an application or system log.
Once you have a file being monitored for messages, the resulting log message has to be sent to the Kafka queue. Rsyslog allows you to chose many output formats (including Elasticsearch, flat files, MySQL databases, and so forth). In this example, I show how we write to a kafka queue using the rsyslog OMKAFKA module:
module(load="omkafka") # provides omkafka
# Use rainerscript, as below#$ActionQueueSize 1500000
#$ActionQueueMaxDiskSpace 536870912 # 512MB, most containers have
#$ActionQueueMaxDiskSpace 536870912 # 512MB, most containers have <8GB of space
#$MainMsgQueueDiscardMark 400000 # Low < Discard < High < DiskSpace
#$MainMsgQueueDiscardSeverity 4 # Discard anything lower than warning
*.* action(type="omkafka" topic="rsyslog-prod"
The topic name is “rsyslog-prod” and you can name it anything you want.This example includes a lot of “magic sauce” based on tuning our high production loads. Here are a few things to notice.
- You must use the “RainerScript” style configuration for OMKAFKA. A lot of configuration examples seem to imply that you can use the older-style configuration directives in separate lines. Our testing showed that RainerScript was the only way to make the configuration work properly.
- The “*.*” directive applies to all messages that rsyslog receives. It is better to do filtering at each file include or in different modules before OMKAFKA is loaded. The choice is up to you, but this is the way that we implemented it.
- queue.highwatermark should be around 80% of queue.discardmark
- partitions.auto should be “on”. We have experimented with various settings and never found any suitable alternative. We are trying to figure out a patch that will allow a client rsyslog process to chose a single kafka partition so that messages arrive in order for a downstream consumer. This, naturally, introduces load-balancing concerns. For now, use it as shown above.
You can send a test message from the OS platform CLI and verify the message has been consumed by the kafka queue. Here is a quick CentOS example, using the Kafkacat client tool mentioned in my previous post:
$ logger -t kafkatest Hello World
$ kafkacat -b localhost -C -o -1000 -e -t rsyslog-prod | fgrep kafkatest
2015-10-29T09:55:00.809297-07:00 kafka1.example.com kafkatest: Hello World
Unfortunately, rsyslog versions 8.9, 8.11, 8.12, and even 8.13 have been unstable in our environment. We have also seen log files that do not follow the correct rotation strategy. This has caused us to implement a “restart rsyslog every night” workaround, which we hope to fix at some point.
We maintain a private set of patches against 8.12 that seem to improve the stability of rsyslog in production and we are actively monitoring if serious bugs are being pulled by the upstream maintainers. We also submit patches as we can to improve the stability of the code. The developers are very receptive to bug fixes and pull requests, so we are hopeful that these issues can be ironed out in the future.
As I’ve said before, rsyslog is fast. Kafka is also very fast. As a test, we generated several loads to stress rsyslog and measure the insertion rate into the kafka queue. We setup a single virtual machine generating rsyslog messages into kafka cluster consisting of five virtual machines and only five partitions for testing purposes. Using this test setup we were able to send and consume 100,000 messages per second. We were able to process these messages downstream using logstash on the order of 9,000 messages per second. Here is a graph of some of our early end-to-end (rsyslog-to-Elasticsearch) tests:
The first cluster of data on the left was a batch of 100,000 messages processed serially by partition. The second spike on the right shows how we paralellised over all five partitions.
Our current production load consists of only three physical nodes processing approximately 10,000 messages per second. We consume nearly 300GB of disk space per day just in log messages. We retain approximately 1 week of log messages in the kafka queues. As you can see, our regular production load is well within the specifications of our initial testing and we can scale up as our business needs increase. We are constantly adding new applications, more logs, and more variety of logs to our infrastructure.
I am confident we can keep up with this growth and support quite heavy loads in this scenario.
If you are as exhausted as I am at this point, that is a good thing. Without a doubt, the kafka project has been challenging, exciting, and difficult. I am hopeful that this series is helpful and makes it easier for getting a wider rollout of Kafka message queues and implementing them in high volume production environments.