Tail Nginx logs to Clickhouse using Vector

Using Vector to feed Nginx logs to Clickhouse in real time

Denys Golotiuk
DataDenys
Published in
5 min readAug 1, 2022

--

There’s a lot of ways to feed data to Clickhouse. One case is when you need to constantly feed data from log file(s) to your favorite analytics database. Before thinking about advanced messaging solutions, let’s take a look on super simple but powerful way to pipe Nginx (but not only) log files into Clickhouse, called Vector

Configure Nginx log

We can work with standard Nginx access log, but we also can go further and log additional data using custom format:

log_format track '$remote_addr - $time_iso8601 "$request_uri" '
'$status $body_bytes_sent "$http_user_agent"';
server {
location / {
access_log /var/log/track.log track;
return 200 'ok';
}
}

This configuration will log all requests to /var/log/track.log file, example:

127.0.0.1 - 2022-08-01T17:19:38+03:00 "/?test=1" 200 2 "curl/7.81.0"

This was logged when local curl requested the following:

curl "http://127.0.0.1/?test=1"

Clickhouse table

Now let’s create a Clickhouse table to write log data to:

CREATE TABLE log
(
`ip` String,
`time` Datetime,
`url` String,
`status` UInt8,
`size` UInt32,
`agent` String
)
ENGINE = MergeTree
ORDER BY date(time)

This table will allow us to do some very basic requests analysis.

Setup Vector

Vector is a tool to build data pipelines. It supports Clickhouse from the box. Custom files formats are easy to handle with Vector Remap Language which allows to parse anything unstructured and map it to the given structure.

Installation is simple, on Ubuntu we need to:

curl -1sLf 'https://repositories.timber.io/public/vector/cfg/setup/bash.deb.sh' | sudo -E bash
sudo apt install vector

And make sure it’s ready to use by checking version:

root@desktop:~# vector --version
vector 0.23.0 (x86_64-unknown-linux-gnu 38c2435 2022-07-11)

Configure pipeline

Pipelining with Vector works quite simple. We configure “rules” based on which Vector will collect data, process it and send it to destination (Clickhouse in our case):

Configuration is done in /etc/vector/vector.toml file based on three basic steps:

  1. [source.***] defines sources to collect data from.
  2. [transforms.***] blocks define how to make structure out of unstructured data.
  3. [sinks.***] blocks define destinations to send/store structured data to.

Everything that goes instead of *** is a block name. We can chose any and refer to it later. We can have any number of bloocks of any type.

Collect data

As we have (intentionally) changed standard Nginx access log format, we need to configure our pipeline manually. Our /var/log/track.log log file now is just a text file with unstructured data for Vector. First, we need to teach Vector how to read that data:

[sources.track]
type = "file"
include = ["/var/log/track.log"]
read_from = "end"

Here, we ask Vector to read data from given log file. Note that Vector automatically pulls new entries from log file when Nginx adds them in real time.

Structure data

In order to make structured data, we’ll use regex with named capture groups implemented in VRL to process each entry via transform block:

[transforms.process]
type = "remap"
inputs = ["track"]
source = '''
. |= parse_regex!(.message, r'^(?P<ip>\d+\.\d+\.\d+\.\d+) \- (?P<date>\d+\-\d+\-\d+)T(?P<time>\d+:\d+:\d+).+?"(?P<url>.+?)" (?P<status>\d+) (?P<size>\d+) "(?P<agent>.+?)"$')
'''

Transform code goes in source param. This code will parse data and save captured data to new fields of our data object. This fields will then be available to send to Clickhouse (or process them further).

Store data

Before storing data to Clickhouse, let’s check how our stuctured data looks like. We’ll use console sink for that:

[sinks.print]
type = "console"
inputs = ["process"]
encoding.codec = "json"

As we can see, we ask Vector to print data from process transformation (which we have defined earlier). Save changes to /etc/vector/vector.toml and run Vector in interactive mode (I’ve sent single sample request to 127.0.0.1/?test=3in a separate terminal):

root@desktop:~# vector
...
2022-08-01T14:52:54.545197Z INFO source{component_kind="source" component_id=track component_type=file component_name=track}:file_server: vector::internal_events::file::source: Resuming to watch file. file=/var/log/track.log file_position=497
{"agent":"curl/7.81.0","date":"2022-08-01","file":"/var/log/track.log","host":"desktop","ip":"127.0.0.1","message":"127.0.0.1 - 2022-08-01T17:52:58+03:00 \"/?test=3\" 200 2 \"curl/7.81.0\"","size":"2","source_type":"file","status":"200","time":"17:52:58","timestamp":"2022-08-01T14:53:04.803689692Z","url":"/?test=3"}

We can see parsed fields together with multiple standard properties like message or timestamp. We also should make additional changes to our transform procedure before saving data to Clickhouse:

  1. Create single datetime property from parsed date and time.
  2. Convert status and size to int.

Let’s make that happen by altering our [transforms.process] block:

[transforms.process]
type = "remap"
inputs = ["track"]
source = '''
. |= parse_regex!(.message, r'^(?P<ip>\d+\.\d+\.\d+\.\d+) \- (?P<date>\d+\-\d+\-\d+)T(?P<time>\d+:\d+:\d+).+?"(?P<url>.+?)" (?P<status>\d+) (?P<size>\d+) "(?P<agent>.+?)"$')
.status = to_int!(.status)
.size = to_int!(.size)
.time = .date + " " + .time

'''

Check again to make sure we get desired changes:

{"agent":"curl/7.81.0","date":"2022-08-01","file":"/var/log/track.log","host":"desktop","ip":"127.0.0.1","message":"127.0.0.1 - 2022-08-01T18:05:44+03:00 \"/?test=3\" 200 2 \"curl/7.81.0\"","size":2,"source_type":"file","status":200,"time":"2022-08-01 18:05:44","timestamp":"2022-08-01T15:05:45.314800884Z","url":"/?test=3"}

Everything as expected. Finally, we can configure data storage to Clickhouse. We add new sink for that (and we can either leave or delete previous sink, which outputs parsed data to stdout):

[sinks.clickhouse]
type = "clickhouse"
inputs = ["process"]
endpoint = "http://127.0.0.1:8123"
database = "default"
table = "log"
skip_unknown_fields = true

Here we ask Vector to take data from process transformation and send it to default.log table of Clickhouse. We also use skip_unknown_fields option to skip unnecessary fields.

We save changes, launch Vector and send some requests to Nginx. And we can almost instantly see log data in our Clickhouse table:

Going to production

When configuration is ready and tested, start Vector service to work in background:

service vector start

Performance considerations

My local 16 cores and 32G machine easily process 20k requests per seconds sent in 100 threads. It then takes couple of seconds more to see that data in Clickhouse table. Still we might consider using Buffer table to optimize inserts as we expect frequent inserts here.

Summary

Vector data pipelining tool is a great way to pipe data directly from Nginx logs to Clickhouse in real time. It has powerful tools to structure data so works with any data format.

Sample configuration to pipe custom Nginx access log to Clickhouse table:

[sources.track]
type = "file"
include = ["/var/log/track.log"]
read_from = "end"
[transforms.process]
type = "remap"
inputs = ["track"]
source = '''
. |= parse_regex!(.message, r'^(?P<ip>\d+\.\d+\.\d+\.\d+) \- (?P<date>\d+\-\d+\-\d+)T(?P<time>\d+:\d+:\d+).+?"(?P<url>.+?)" (?P<status>\d+) (?P<size>\d+) "(?P<agent>.+?)"$')
.status = to_int!(.status)
.size = to_int!(.size)
.time = .date + " " + .time
'''
[sinks.clickhouse]
type = "clickhouse"
inputs = ["process"]
endpoint = "http://127.0.0.1:8123"
database = "default"
table = "log"
skip_unknown_fields = true

--

--

Denys Golotiuk
DataDenys

Data-intensive apps engineer, tech writer, opensource contributor @ github.com/mrcrypster