Structured Logging for Rails using ELK

This post describes the structured logging setup used at Marketplacer to publish information about each request to our Rails application to our ELK stack.

Our primary goal for our ELK platform was forensics & long term usage analysis of back-office functionality. For example, we wanted to know:

  • has this route or controller been used at all in the last six months?
  • how did this advert’s price get set to zero, and by who?
  • we’re seeing a sudden uptick in transactions per second every day at 1 AM — is it one of our API users, or is it a bot trying to scrape us?

We needed to store everything we could about every request (including POST/PATCH parameters) and we needed to keep it for six months or longer. Storing only a fraction of the requests (as in an instrumentation platform like Google Analytics) or not storing the parameters of the request (as in traditional text log files) would not meet our goals.

We did not need to store detailed error information or debug output; we use Rollbar for that.

Our solution was to write a log subscriber that would sit at the same level as the normal Rails request logger and write to its own log file to be ingested by our ELK stack. This log file would not replace or combine with the usual production.log or unicorn.stderr.log produced by our application. filebeat (part of the ELK stack) would tail the file and send its contents to ur Elasticsearch logging server.

The solution is described below from bottom-up:

  • how we obtain events & emit log entries in Rails
  • how those events are sent to Elasticsearch via Filebeat
  • how Elasticsearch indexes those events via ingest pipelines
  • how we configure Kibana to display those events

Some of the code & Elasticsearch configuration used in the below has been made open source and is available in this gist.

Structured Output Format

Our log subscriber outputs one line of JSON per request. We use JSON as it can then be directly ingested in to Elasticsearch with little to no modification. This makes it much more reliable and flexible than a more traditional human readable log file. A reformatted example follows:

The only attribute outside of the rails object is @timestamp, which is used by the ELK stack to order events. We provider our own timestamp as we don't want filebeat to use the timestamp at which the log file was read, but the time at which the event occurred, as they may not be exactly the same.

Inside the rails object:

  • params contains all parameters that we can safely record or index, including parameters supplied as part of a PATCH/POST/PUT request
  • action and controller are duplicated from route to make it easier to search for them specifically within ELK
  • session_id is recorded partly so we can attempt to replay a user's actions from within ELK
  • original_path includes any query string parameters: this is primarily useful for discovering parameter decoding edge cases in our application or Rails itself

Structured Output Code

The code used to generate this output is split in to two main sections: the subscriber itself, and a routine to flatten the parameters.

Flattening Parameters

Parameters in Rails can be represented internally as objects containing objects, eg:

While we can insert this information directly into Elasticsearch, what we can’t then do is insert anything which contains the same key names but a different schema. For example, we can’t insert the above object and then insert another where advert is a single string instead of an object.

We work around this by flattening the parameters into a single object with many keys, eg:

The flattener is also responsible for:

  • trapping instances of Rack::Test::UploadedFile or Rack::Multipart::UploadedFile and representing them only as [FILE UPLOAD #{filename}]
  • truncating any string longer than 120 characters

Log Subscriber

The log subscriber attaches itself to the ActionController on the process_action and redirect_to methods. process_action handles publishing the log event to ELK, whereas redirect_to sets a thread local variable for process_action to later pick up and register as a redirect.

process_action does the following:

  • runs the request’s parameters through the flattener
  • attaches exception information, if present
  • attaches relevant headers (eg HTTP_HOST, HTTP_USER_AGENT, etc)
  • obtains a user ID and/or administrator ID from the session, if present
  • attaches overall, database and view runtimes
  • attaches route as a combination of controller and action to make searching easier in ELK
  • adds a @timestamp
  • encodes the result as JSON and emits it to the log

Each line is written to log/elk.json.log and is picked up by filebeat running on the production system to be sent to our ELK stack.

Filebeat

Filebeat is part of the “beats” suite of software, published by Elasticsearch as part of their ELK stack. Filebeat reads logs files, applies some basic filtering, and sends the contents on to Elasticsearch.

It can be used as a lightweight replacement for a heavy logstash instance. It is an entire replacement for logstash-forwarder.

Each production server logs ELK events to its local filesystem, and thus one filebeat instance runs on each production server.

Our filebeat configuration contains entries for other things we ingest. I’ve replicated only the portions relevant to our Rails logging below:

Note the following:

  • we tell Elasticsearch to use the ingest pipeline named rails on each event
  • the events from elk.json.log are tagged rails, which means they're then sent to the logs-rails-* indices
  • json.keys_under_root means we treat each log entry as the entire event, rather than embedding its contents in a json object when we send it to Elasticsearch

Elasticsearch

Apart from regular Elasticsearch configuration, there are two main things we had to consider when setting up ES to handle Rails log data:

  • Ingest Pipelines
  • Index Templates

Ingest pipelines transform the data before it is indexed. They can be used as a replacement for logstash transforms, and they even have the capability to understand logstash “grok” rules.

Index templates describe defaults for how new indexes are created, and both the format and limitations of the data to be added to the index. Elasticsearch can guess this based on the data, but it’s much safer to specify it ahead of time.

Ingest Pipeline

We created a simple ingest pipeline named rails. It uses the ingest-geoip and ingest-useragent plugins to find a location for the request's IP address and decode the HTTP User Agent respectively.

Index Template

Our index template for indices matching logs-* is based heavily on the defaults that filebeat provides for its indices, with some modifications:

  • we limit the index to one shard and zero replicas, to save space on our single machine Elasticsearch setup
  • we copy the geoip and user_agent mapping definitions from the nginx-access default mappings into our rails section
  • we specify dynamic mapping rules for our rails parameters:

Kibana

Once the data is in Elasticsearch we can use Kibana as a dashboard to analyze it. We found it useful to set up Kibana before setting up filebeat: its development tools allow for easy interaction with the Elasticsearch API.

Most of our interactions with Kibana are via the “Discover” tab. We use filters to identify long-term usage of site functionality or discover the series of events that led to an error. Right-clicking on a particular field also allows us to show a visualisation on that field: eg a breakdown of user agents or countries.

One big caveat with Kibana is that it can suffer when there are lots of different fields in your indexes, which is a problem we have. It’s occasionally unfortunately necessary to hard-restart Kibana as it enters a 100% CPU locked state.

Now that we have Kibana we can finally answer some of the questions raised above:

To find out if code is being used, we first add a filter against rails.route or rails.controller and then set the time range at the top of the page to one month.

To discover what happened to our bad advert, we can first try filtering rails.params.id for its ID and rails.method to limit results to POST requests. We then set the time range to when we suspect the incident occurred. If we find the relevant request, we can finally filter on rails.session_id to work out how they got to that point in our application.

To work out the origin of the extra transactions, we first set the time range to when those transactions are occurring. We can then create visualisations based on source IP address, destination controller, and GeoIP location to determine the nature of the traffic.


Originally published at mipearson.id.au.