Using Vector to feed Nginx logs to Clickhouse in real time
There’s a lot of ways to feed data to Clickhouse. One case is when you need to constantly feed data from log file(s) to your favorite analytics database. Before thinking about advanced messaging solutions, let’s take a look on super simple but powerful way to pipe Nginx (but not only) log files into Clickhouse, called Vector…
Configure Nginx log
We can work with standard Nginx access log, but we also can go further and log additional data using custom format:
log_format track '$remote_addr - $time_iso8601 "$request_uri" '
'$status $body_bytes_sent "$http_user_agent"';server {
location / {
access_log /var/log/track.log track;
return 200 'ok';
}
}
This configuration will log all requests to /var/log/track.log
file, example:
127.0.0.1 - 2022-08-01T17:19:38+03:00 "/?test=1" 200 2 "curl/7.81.0"
This was logged when local curl requested the following:
curl "http://127.0.0.1/?test=1"
Clickhouse table
Now let’s create a Clickhouse table to write log data to:
CREATE TABLE log
(
`ip` String,
`time` Datetime,
`url` String,
`status` UInt8,
`size` UInt32,
`agent` String
)
ENGINE = MergeTree
ORDER BY date(time)
This table will allow us to do some very basic requests analysis.
Setup Vector
Vector is a tool to build data pipelines. It supports Clickhouse from the box. Custom files formats are easy to handle with Vector Remap Language which allows to parse anything unstructured and map it to the given structure.
Installation is simple, on Ubuntu we need to:
curl -1sLf 'https://repositories.timber.io/public/vector/cfg/setup/bash.deb.sh' | sudo -E bash
sudo apt install vector
And make sure it’s ready to use by checking version:
root@desktop:~# vector --version
vector 0.23.0 (x86_64-unknown-linux-gnu 38c2435 2022-07-11)
Configure pipeline
Pipelining with Vector works quite simple. We configure “rules” based on which Vector will collect data, process it and send it to destination (Clickhouse in our case):
Configuration is done in /etc/vector/vector.toml
file based on three basic steps:
[source.***]
defines sources to collect data from.[transforms.***]
blocks define how to make structure out of unstructured data.[sinks.***]
blocks define destinations to send/store structured data to.
Everything that goes instead of ***
is a block name. We can chose any and refer to it later. We can have any number of bloocks of any type.
Collect data
As we have (intentionally) changed standard Nginx access log format, we need to configure our pipeline manually. Our /var/log/track.log
log file now is just a text file with unstructured data for Vector. First, we need to teach Vector how to read that data:
[sources.track]
type = "file"
include = ["/var/log/track.log"]
read_from = "end"
Here, we ask Vector to read data from given log file. Note that Vector automatically pulls new entries from log file when Nginx adds them in real time.
Structure data
In order to make structured data, we’ll use regex with named capture groups implemented in VRL to process each entry via transform
block:
[transforms.process]
type = "remap"
inputs = ["track"]
source = '''
. |= parse_regex!(.message, r'^(?P<ip>\d+\.\d+\.\d+\.\d+) \- (?P<date>\d+\-\d+\-\d+)T(?P<time>\d+:\d+:\d+).+?"(?P<url>.+?)" (?P<status>\d+) (?P<size>\d+) "(?P<agent>.+?)"$')
'''
Transform code goes in source
param. This code will parse data and save captured data to new fields of our data object. This fields will then be available to send to Clickhouse (or process them further).
Store data
Before storing data to Clickhouse, let’s check how our stuctured data looks like. We’ll use console
sink for that:
[sinks.print]
type = "console"
inputs = ["process"]
encoding.codec = "json"
As we can see, we ask Vector to print data from process
transformation (which we have defined earlier). Save changes to /etc/vector/vector.toml
and run Vector in interactive mode (I’ve sent single sample request to 127.0.0.1/?test=3
in a separate terminal):
root@desktop:~# vector
...
2022-08-01T14:52:54.545197Z INFO source{component_kind="source" component_id=track component_type=file component_name=track}:file_server: vector::internal_events::file::source: Resuming to watch file. file=/var/log/track.log file_position=497{"agent":"curl/7.81.0","date":"2022-08-01","file":"/var/log/track.log","host":"desktop","ip":"127.0.0.1","message":"127.0.0.1 - 2022-08-01T17:52:58+03:00 \"/?test=3\" 200 2 \"curl/7.81.0\"","size":"2","source_type":"file","status":"200","time":"17:52:58","timestamp":"2022-08-01T14:53:04.803689692Z","url":"/?test=3"}
We can see parsed fields together with multiple standard properties like message
or timestamp
. We also should make additional changes to our transform procedure before saving data to Clickhouse:
- Create single
datetime
property from parseddate
andtime
. - Convert
status
andsize
to int.
Let’s make that happen by altering our [transforms.process]
block:
[transforms.process]
type = "remap"
inputs = ["track"]
source = '''
. |= parse_regex!(.message, r'^(?P<ip>\d+\.\d+\.\d+\.\d+) \- (?P<date>\d+\-\d+\-\d+)T(?P<time>\d+:\d+:\d+).+?"(?P<url>.+?)" (?P<status>\d+) (?P<size>\d+) "(?P<agent>.+?)"$')
.status = to_int!(.status)
.size = to_int!(.size)
.time = .date + " " + .time
'''
Check again to make sure we get desired changes:
{"agent":"curl/7.81.0","date":"2022-08-01","file":"/var/log/track.log","host":"desktop","ip":"127.0.0.1","message":"127.0.0.1 - 2022-08-01T18:05:44+03:00 \"/?test=3\" 200 2 \"curl/7.81.0\"","size":2,"source_type":"file","status":200,"time":"2022-08-01 18:05:44","timestamp":"2022-08-01T15:05:45.314800884Z","url":"/?test=3"}
Everything as expected. Finally, we can configure data storage to Clickhouse. We add new sink for that (and we can either leave or delete previous sink, which outputs parsed data to stdout
):
[sinks.clickhouse]
type = "clickhouse"
inputs = ["process"]
endpoint = "http://127.0.0.1:8123"
database = "default"
table = "log"
skip_unknown_fields = true
Here we ask Vector to take data from process
transformation and send it to default.log
table of Clickhouse. We also use skip_unknown_fields
option to skip unnecessary fields.
We save changes, launch Vector and send some requests to Nginx. And we can almost instantly see log data in our Clickhouse table:
Going to production
When configuration is ready and tested, start Vector service to work in background:
service vector start
Performance considerations
My local 16
cores and 32G
machine easily process 20k
requests per seconds sent in 100
threads. It then takes couple of seconds more to see that data in Clickhouse table. Still we might consider using Buffer table to optimize inserts as we expect frequent inserts here.
Summary
Vector data pipelining tool is a great way to pipe data directly from Nginx logs to Clickhouse in real time. It has powerful tools to structure data so works with any data format.
Sample configuration to pipe custom Nginx access log to Clickhouse table:
[sources.track]
type = "file"
include = ["/var/log/track.log"]
read_from = "end"[transforms.process]
type = "remap"
inputs = ["track"]
source = '''
. |= parse_regex!(.message, r'^(?P<ip>\d+\.\d+\.\d+\.\d+) \- (?P<date>\d+\-\d+\-\d+)T(?P<time>\d+:\d+:\d+).+?"(?P<url>.+?)" (?P<status>\d+) (?P<size>\d+) "(?P<agent>.+?)"$')
.status = to_int!(.status)
.size = to_int!(.size)
.time = .date + " " + .time
'''[sinks.clickhouse]
type = "clickhouse"
inputs = ["process"]
endpoint = "http://127.0.0.1:8123"
database = "default"
table = "log"
skip_unknown_fields = true