Using the ELK stack to analyze your S3 logs

In this post I’m going to explain how to leverage the ELK stack for a one-time S3 log analysis. This is neither meant to be complete nor extensive documentation of all the possible application with this stack, but should give you an idea what’s involved and the further capabilities of the system. The reader is expected to have a basic understanding of the ELK components involved.

Shortly after joining die.socialisten.at last year to expand their team working on swat.io, we switched our storage for all media files over to Amazon S3. This was primarily driven by the fact that we wanted to support our then-upcoming new feature to publish videos on various social networks.

Looking at the pricing table it became clear that storage wouldn’t become an issue, but we were unsure about the traffic. I went through our nginx access logfiles and used some simple shell scripts to produce a number, telling us how much traffic we currently were generating per month. Back then this was about 50GB, which, in retrospect, is a laughable small number and that number alone probably wouldn’t have warranted the move. But we had big plans regarding the videos anyway and the numbers compared with the pricing table seemed economically well for us.

Fast forward a year later: our monthly traffic is now just a bit below 2TB. Bills went up, questions arose, so what’s going on?

The goal was not necessarily to reduce the costs or the traffic but to understand what’s going on: know thy system!

First mantra: log everything!

Well, in our case we didn’t. We hadn’t activated logging for S3 yet, so it was about time. This also meant that we would have to wait at least a few days to get enough data out of the system for interpretation. Interestingly this gave me time to sleep over how we were going to analyze the data 😀

Some things learned from activating the S3 logs:

  • Logs are collected per region. We’re utilizing multiple regions so we have a dedicated bucket per region receiving all logs.
  • Logs are quick! Not fast, but quick, i.e. it doesn’t take long for new logs to appear.
  • There’s no “one big logfile”, in fact they’re delivered in chunks. I assume they’re queued internally and then flushed to your bucket. Depending on the traffic on your specific bucket, a chunk may be as small as a few hundred bytes up to a multiple megabyte or more.

I wouldn’t say we’ve a lot of buckets but we’ve “quite some”. We wanted to collect all logs from all buckets of a region in one designated bucket, but it should be easy to separate logs from different buckets. That’s why we chose to send the logs prefixed with the name of the bucket where they originated in.

Setting up ELK

Once we had enough data for analyzing, I set up the stack in an existing Vagrant machine I was already using for day-to-day development. Some assumption about the environment:

  • Java is installed; our VM is Ubuntu 14.04 based so we’re using the PPA from https://launchpad.net/~webupd8team and are currently running 1.8.0_66 .
  • You can access port 5601 (Kibana) of your VM with a browser

None of the next steps should require escalated privileges and thus could be setup in any environment. Since running the ELK stack means running multiple application, you either need multiple terminals or use a terminal multiplexer (I highly recommend tmux).

For my preparation inside the VM I created a new directory ~/logstash and operated solely within it:

File layout after unpacking everything

Starting up ElasticSearch

There shouldn’t be much to configure, simply starting it should be enough. If you’re curious or paranoid, take a look at elasticsearch-2.1.0/config/elasticsearch.yml before starting it with elasticsearch-2.1.0/bin/elasticsearch . By default all the data will be put into elasticsearch-2.1.0/data/ , quite convenient.

Preparing the log data for Logstash

Due to the fact that generated logs span multiple, possible thousands, files, we first need to consolidate them. I opted for generating a single big logfile which I then fed into Logstash. After downloading all the small chunks locally, I used the “find” utility to gather all files and create a single big file via shell redirection:

find s3-logs/ -type f -exec cat {} \; > s3.log

In my case this file ended up being 7.0GB containing 20 million lines. Creating a single logfile this way likely will not pay attention to the ordering of entries according to their date time, but this doesn’t matter. Logstash/ElasticSearch will be just fine with this.

Configuring Logstash

Before starting Logstash we need to create a configuration to

  1. tell where to load the data from
  2. apply some filtering (especially for the timestamp)
  3. tell where to find ElasticSearch

For that I created a file ~/logstash/logstash.conf and started with the input file plugin configuration:

input {
file {
path => “/home/vagrant/logstash/s3.log”
# path => “/home/vagrant/logstash/s3-sample.log”
start_position => beginning
}
}

As you can see, I also tested with a sample file first to ensure the configuration was right. If you want to try this, be aware that Logstash keeps a reference what files it parsed until which position. To re-run my tests I had to remove the file(s) in ~/.sincedb_* .

Next up is configuring the filters:

  • what information to extract
  • properly extract the date time field
  • remove redundant fields

Luckily for us, Logstash already comes with a grok filter for S3_ACCESS_LOG, so this part is simple:

filter {
grok {
match => { “message” => “%{S3_ACCESS_LOG}”}
}
}

The task of the grok filter is to break up these long access log lines into smaller fields which we later use to drive our data analysis.

Except for defining where to find ElasticSearch, this would already be enough to be able to pump in the data. However, one annoying and one unnecessary thing would happen:

  1. all messages would be imported with the current timestamp when they arrive in Logstash (i.e. the moment they’re read from the logfile), instead of the date time from the actual log line
  2. We would also store the field “message”, which represents a complete log line. But since the grok filter already broke it up into distinct fields, this is unnecessary bloat for my test case and thus I removed the field.

Logstash works with any number of fields but there are certain special fields, especially the field “@timestamp”. But unless we tell Logstash how to parse the date time from the log line into this field, it wouldn’t be fun analyzing specific date ranges. It would still be possible to look at the data and work with it, but only as a whole because all log entries would appear almost within the time span.

You may wonder: I’m already using S3_ACCESS_LOG which extracts the “timestamp”, why isn’t this automatically put into “@timestamp”? The grok filters delivered as part of Logstash are a very flexible system built on top of each other to re-use matching expressions. As a result, they’re mostly only concerned about breaking up log messages into their distinct parts for further processing, rather than providing one-size-fits-all solution approach.

The grok date filter is able to exactly do this. Using the Java Joda time library under the hood, it is very flexible how to match a date time. A date time from S3 logs look like this:

... [05/Oct/2015:18:12:53 +0000] ...

The S3_ACCESS_LOG grok pattern already extracts this into the field “timestamp” but we now need to get it into the “@timestamp” field so Logstash knows about it (yes, the “@” is the difference …):

filter {
# ...
date {
locale => “en”
match => [“timestamp”, “d/MMM/YYYY:HH:mm:ss Z”]
}
}

The “locale” is necessary because Joda needs to understand “Oct” as month and convert it into a numerical presentation. By default, the “date” filter already copies the information to the “@timestamp” field so we’re set.

Next up is removing the “message” field. This is not necessary but I’d like to save space as we’re not going to use this field because the grok filter already split it up into all the possible fields:

filter {
# ...
alter {
remove_field=> [“message”]
}
}

Note: using the “alter” filter was the actual reason to download “Logstash (All plugins)” because it’s a community maintained plugin and not available by default.

The last step is to tell Logstash where to find ElasticSearch and send the data to. I’ll skip this as a separate step, here is the complete configuration used for Logstash:

input {
file {
path => “/home/vagrant/logstash/s3.log”
# path => “/home/vagrant/logstash/s3-sample.log”
start_position => beginning
}
}
filter {
grok {
match => { “message” => “%{S3_ACCESS_LOG}”}
}
date {
locale => “en”
match => [“timestamp”, “d/MMM/YYYY:HH:mm:ss Z”]
}
alter {
remove_field=> [“message”]
}
}
output {
elasticsearch {
hosts => [“127.0.0.1:9200”]
}
}

Before starting Logstash:

  • I ensured the VM had enough resources and I gave it 8GB RAM and 4 cores; probably not necessary but I wouldn’t want the data import to take forever.
  • Be prepared to receive irritating warning and some error messages. Sometimes Logstash may pump the data too fast for ElasticSearch, sometimes log entries don’t properly match up, etc. With 20 Million rows it didn’t matter for me possibly losing a few.

Starting Logstash:

vagrant@dev ~/logstash $ logstash-2.1.0/bin/logstash — config logstash.conf
Settings: Default filter workers: 2
Logstash startup completed

Logstash was meant to watch growing logfiles and run in the background, so unfortunately we’re not getting any progress indicator. It’s a safe bet that this process can take one or more hours, depending on the speed of your system. However, the parsed log lines will start immediately to appear and be available in ElasticSearch, thus we can already start using Kibana anyway!

Starting Kibana

Review the configuration at kibana-4.3.0-linux-x64/config/kibana.yml and then start up Kibana:

kibana-4.3.0-linux-x64/bin/kibana

By default it listens on port 5601 on all interfaces.

Setting up Kibana

As with all application of the ELK stack, the defaults usually work well but sometimes a bit of massaging has its benefits.

Kibana should now welcome you at <your vagrant ip>:5601 with the following screen:

These settings are fine for us. Logstash will create time based indices automatically for every day, i.e. “logstash-yyyy-mm-dd”. Also, we didn’t change the default “@timestamp”; we actually made sure it contains useful data so this is a go.

Next up is configuring the fields:

This can be overwhelming at first.

Most of the defaults are fine. For example, if you page through the fields you will find “bytes” and “object_size” representing the bytes sent to the client and the actual size of the object, respectively (unless there was a problem with that particular HTTP access, those two fields probably will always be the same).

Kibana automatically detected these fields of type “number” which is already enough to work with, but we can go further: we can tell Kibana this field actually represents “bytes” (yes, confusing: field name and Kibana type are called the same here). By editing the field with the pen icon on the right, we change the type and hit “Update field” (for both fields “bytes” and “object_size”).


This should be everything and we can now go to the main point: “Discover”!

Using Kibana

It’s likely that the first time you will be greeted with “No results found”. This is due to the default showing the “last 15 minutes” of data. In our case we imported data with the recent update probably from a few days ago, so we just need to expand the time range appropriately:

In our case we will choose “Last 60 days”

Finally, success at last!

Note: at this point I already had imported the complete data set!

This is already very good and gives us a quick first impression. It also says 20 million hits, which matches the overall number of lines from the logfile. The logfile is 7GB in size and I checked the ElasticSearch index, it’s 8.4GB. I’d say that’s not bad! The overhead for a system, which can retrieve your data in a fraction of a second out of 20 million samples, could be worse 😀

“Discover” is the part of Kibana where you can play around easily and try to get a general grasp of the data. But we’ve had a specific question: who generated so much traffic?

Say hello to “Visualize”

Within this part of Kibana you can specifically select what to show on the X or Y-axis, perform nested queries, etc. For a start, you’re asked to select chart type. We’ll go with the good old boring Vertical bar chart:

In the next step we’re starting a new search and we will end up with a big block and nothing more:

Not very exciting …

This now represents the previously selected timeline but we haven’t yet specified where we want to drill into the data.

To better understand why we’re seeing this, open up the “Y-axis” with the small arrow icon. It says “Count”, which means: all it does is counting all the log entries matching our time span.


To see the actual generated traffic, we’ve to answer the following question:

What is the sum of all the “bytes” fields?

And this is exactly what we want to aggregate over! Select “sum” instead of count, choose the “bytes” field and click the “big white arrow inside the green background” to redraw the graph → Presto!

Now we understand that our 20 million requests translate to 2.8TB of data over 60 days.

Next thing is that we want to somehow identify the consumers of the traffic, ideally the top ones. For that we need to decide on the metric: IP address? User agent? Let’s try the “IP address” first by further splitting this block into bars, showing the individual IP addresses. As you can see in the above screenshot it is already hinted that, by selecting “Split Bars”, we get yet another aggregation dialog.

This time however we select the“Terms”-aggregation and the field “clientip.raw”. A “Term” is usually (but not necessarily) something string-ish. For example if you imagine something which has “tags”, each individual “tag” could be viewed as a “term” which allows us to answer questions like “what are the top used tags”. The “raw” version of a field means that it’s content has not been analyzed. This is desired for some fields and can backfire on others. In our case we want to treat the whole IP field as one term, so we use the “raw” (not analyzed) version.

In our case, the question is:

What are the top used IP addresses?

If you look more closely you will see that this only covers around 150GB of data actually. What happened to the rest?

By default we only get the “Top 5” IP addresses, which in our case don’t actually make up the majority of addresses. We need to increase the “size” e.g. to 500.


This now covers more of the traffic area but is almost unusable as the IPs are too varied:

Well, this wasn’t very useful.

By changing the field to “agent.raw” (and don’t forget to click on “white arrow on green background”), we can see the other metric.

What are the top most user agents?
Well, it’s “facebookexternalhit” in our case, who would have thought that?

This is now more useful and finally answers our question. To be honest, we’re also lucky that the variation of user agents is within bounds.

If you remember the first screenshot showing time series data, we saw the log entries spread over each day. What if we want to combine this with our knowledge to split the bars for the user agents, showing as the user agent usage per day? Well, yes, we can.

The trick is to create two nested aggregations for the X-axis: first we create a “Date Histogram” over the “@timestamp” field and and then nest a “Split Bars” aggregation with the user agent, much in the same way we have done before:

What else?

Once the data is available in ElasticSearch for Kibana to visualize, the approach is almost toy-like: fiddle around, change parameters, let yourself surprise about the insights you can get. The ELK stack is departure from the “old grumpy charting days”, providing a modern approach to handle big data and inviting its users to play with their data and discover new things that way.

Kibana itself features three main categories:

  1. Discover
    Interactively explore your data, directly see the result of the matching documents. Filter, test, play, save your queries.
  2. Visualize
    Build your own visualizations based on new or saved queries with a range of different options, configure them as you like and save the visualizations for later re-use.
  3. Dashboard
    Display your saved visualization in any way you like.

Any chart you visualize can be displayed on the Dashboard, allowing you to combine any amount of insights you deem useful for you or your team.

Closing words

Although the whole setup looks quite involving, once you get the hang of it, it’s just a matter of running ElasticSearch, configuring Logstash properly to read your logfile and access Kibana to visualize the data from ElasticSearch.

This example went through the steps to analyzing the data of a single, static logfile. But Logstash actually originated being a realtime data collector from various sources, using it’s multitude of inputs/filters/output systems to unify them in ElasticSearch. If we would have configured Logstash to receive real growing logfile data so they would get pumped into ElasticSearch, we would be able to visualize the updated metrics almost instantly in Kibana.

ElasticSearch is on the rise and well known for its full text search capabilities. However, there’s a different side to this technology with its possibilities of the inverted index and flexible aggregation functions opening new doors, providing easily accessible analytics for the masses.