Configuring ELK stack with docker container and Logstash filtration tutorial part 1

David Oodugama
14 min readDec 4, 2022

--

First of all, before going into the configuration of the ELK stack with docker containers let’s get a basic understanding of these tools.

What is a docker container and why do we use it?

A Docker container is a piece of software that packages the entire application with all its dependencies like runtime, system tools, system libraries, and code to run inside an isolated environment. A Docker image is a lightweight executable package. And a container image becomes a container when they run on the docker engine. Doing so helps the developer run the application in any computer environment without needing to install any dependencies on each computing environment. Docker containers run on top of the docker engine which will extract the necessary resources from the OS to run the container. Compared to a virtual machine, docker is also a VM, but in a virtual machine, it is virtualizing the OS itself, so in that case in each VM(Virtual Machine), there should be a separate OS to run the VM which uses a lot of resources from the host OS. But one of the big advantages of a docker container is it virtualizes only the application layer. In this case, only a few resources are used from the host OS since, no need for a separate OS on each container.

What is a docker-compose file?

There are 2 ways to run docker containers one is by using docker commands or using a docker-compose file to run docker containers. Docker-compose file is a structured way of using docker commands which helps to update any configuration easily. Running multiple containers by using docker commands every time is a tedious task that’s why the docker-compose file is used. At some point, a user can refer to which configurations were used to run the specific container by referring to the file is one of the advantages.

What is ELK stack?

ELK stack is a combination of 3 components called Elasticsearch, Logstash, and Kibana. Also added to this, we will be using an additional component called Filebeat which I will explain each of these components one by one in this blog. ELK stack is a monitoring tool that monitors computer/VM performance by collecting its metrics or monitoring application by collecting logs from it and visualizing in a dashboard in real-time to give a broader idea of what’s happening to the user for further analysis. ELK stack has many things like AI as well, but in this blog, we will be mainly focusing on log monitoring. This technology is also used as a security tool in the area of identifying Cyber-attacks as well.

What is Elasticsearch?

Elasticsearch is a storage repository for data which is an unstructured database where all the application logs which are read and come through the pipeline get stored in index format. It is accessed via RESTful APIs. It provides easy governance and straightforward implementation. Additionally, it provides sophisticated queries for performing an in-depth evaluation, and it stores data in a centralized location. As well as it is like the google search engine that retrieves data relevant to the query and responds quickly, which helps users to make quick searches.

What is Logstash?

This is a component that collects logs from various files or collects logs from beats and processes those logs such as doing filtration and cleaning of data to help to do a deep analysis of the data itself. After this process is complete, the cleaned and preprocessed data will be stored in an index in Elasticsearch.

What is Filebeat?

Filebeats is a lightweight additional tool that the Elasticsearch enterprise has introduced to read logs much more efferently which takes away the weight of the reading time of Logstash and allows it to focus its resources on the preprocessing of data. Currently, Filebeats can not read files that are in gzip format or zip format it is under development. To read gzip format files, it has to be read from the Logstash itself for the moment.

What is Kibana?

It is a visualization tool where it retrieves the data in a JSON (JavaScript Object Notation) format by making RESTful API requests to Elasticsearch. Kibana gives a UI (User interface) in a be format to make it easier for its users to analyze the incoming logs from Elasticsearch.

What are logs?

It is the process of analyzing, perceiving, and comprehending computer-generated logs. Log analysis is one of the ways for enterprises that want to improve security as well as to monitor and find bugs in an application. Log analysis is also used for identifying cyber attacks by monitoring user behavior since everything that a user does in a system is logged in a file.

For this tutorial the file structure is show below,

File structure of ELK stack docker configuration

Ok, now we have a basic understanding of the tools that we are going to use in this discussion. Let’s dive into the configurations of each tool. First, download the Apache log dataset from this link and put it inside the mylog folder. To deploy each of these tools we will be using a docker-compose file and one thing to note is every container image version should be the same version otherwise ELK stack will not run. Create a docker-compose.yml file and put the below code.

version: '3.6'
services:
elasticsearch:
container_name: elasticsearch
restart: always
ports:
- '9200:9200'
- '9300:9300'
ulimits:
memlock:
soft: -1
hard: -1
environment:
- bootstrap.memory_lock=true
- discovery.type=single-node
- xpack.security.enabled=true
- xpack.security.authc.api_key.enabled=true
- ELASTIC_PASSWORD=<elastic_password>
- ES_JAVA_OPTS=-Xms512m -Xmx512m
image: 'docker.elastic.co/elasticsearch/elasticsearch:7.17.0'
network_mode: bridge

kibana-01:
container_name: kibana-01
ports:
- '5601:5601'
environment:
ELASTICSEARCH_HOSTS: http://elasticsearch:9200
volumes:
- ./kibana/kibana.yml :/usr/share/kibana/config/kibana.yml
image: 'docker.elastic.co/kibana/kibana:7.17.0'
ulimits:
memlock:
soft: -1
hard: -1
links:
- elasticsearch:elasticsearch
depends_on:
- elasticsearch
network_mode: bridge

logstash:
image: logstash:7.17.0
container_name: logstash
ports:
- 5044:5044
ulimits:
memlock:
soft: -1
hard: -1
restart: always
environment:
- "ES_JAVA_OPTS=-Xms2g -Xmx2g"
volumes:
- ./logstash/logstash.conf:/usr/share/logstash/pipeline/logstash.conf
- ./logstash/logstash.yaml:/usr/share/logstash/config/logstash.yml
links:
- elasticsearch:elasticsearch
depends_on:
- elasticsearch

network_mode: bridge

filebeat:
user: root
container_name: filebeat-01

volumes:
- ./mylog:/usr/share/filebeat/mylog
- ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml
- "/var/run/docker.sock:/var/run/docker.sock"

command: filebeat -e -strict.perms=false

image: 'docker.elastic.co/beats/filebeat:7.17.0'
ulimits:
memlock:
soft: -1
hard: -1
deploy:
mode: global
depends_on:
- logstash
network_mode: bridge

I will be explaining a few of these configurations so that you will have an idea of what’s happening.

Restart: always => Is whenever a change in the docker-compose file for that specific container, it will always restart the container automatically.

Ulimits => ulimits limit a program’s resource utilization to prevent a run-away bug or security breach from bringing the whole system down.

Memlock => This is a memory that will not be paged out. And hard and soft are set to -1 allowing an unlimited amount of memory to be locked from the container. What is a memory page? A page that has been transferred from your RAM to swap space on a hard disk that is a substitute for physical memory.

Environment => In this section all the environment variables have to be declared and whenever a container run it will automatically set these variables in its environment.

“Network_mode: bridge” => A bridge network allows connected containers on the same bridge network to communicate with each other while the containers which are not in the same network are isolated from each other.

Volumes => which are file systems mounted on Docker containers to preserve data generated by the running container. This allows users to back up data and share file systems between containers and the host machine easily.

Links => This feature configures docker to link containers over a network.

Depends_on => The order in which services must start and stop

Command => This feature is used to run commands inside the docker container when it starts to run

“Deploy.mode: global” => It is the replication model used to run the service on the platform. It is set to global means running one task on every node.

Lets discuss the Elasticsearch environment variables now.

bootstrap.memory_lock=true” => Elasticsearch performance can suffer greatly if the node is permitted to swap memory to disk. By setting the bootstrap memory lock to true, Elasticsearch can be configured to automatically prevent memory swapping on its host machine.

“discovery.type=single-node” => The Elasticsearch node chooses to be the master and does not join a cluster with any other node.

“xpack.security.enabled=true” => This enables the security feature of Elasticsearch so to retrieve data from it you have to use its credentials first. The Elasticsearch node chooses to be the master and does not join a cluster with any other node.

“xpack.security.authc.api_key.enabled=true” => This is used for configuring alerts in Kibana which we will not be discussing in this discussion.

“ELASTIC_PASSWORD=xx” => You can set the Elasticsearch password from here.

“ES_JAVA_OPTS=-Xms512m -Xmx512m” => The heap memory that needs to run the container. Here we have set it to 512MB

Let's discuss the Filebeat configurations now.

In volumes “./mylog” is the local file path in which my logs files are located and it is bind with the path “/usr/share/filebeat/mylog” inside the path in the container which copies the files from local to the container path.

“-strict.perms=false” a flag disables syslog/file output and redirects all output in the std error.

Now let us set configure the Filebeat configurations to order to transfer and read files. Create a file name called filebeat.yml inside the Filebeat folder.

filebeat.inputs:
- type: log
enabled: true
paths:
- /usr/share/filebeat/mylog/apache-daily-access.log
clean_removed: true
scan_frequency: 5s

output.logstash:
hosts: ["172.17.0.4"]

Here you have to specify the file type you want your Filebeat to read. In my case, it is a log format. Enable to get input. In the paths section, you have to specify the path where your logs are stored inside the Filebeat container. You need to specify the path you have bind with the local path in the docker-compose Filebeat section. Clean_removed is set to true because it removes the state for the file which cannot be found on disk anymore immediately. Scan_frequency is set to 5s which will scan the path for new files every 5 seconds. In output.logstash.host you have to specify the Logstash container network id in here. To identify this first run the Logstash container and open a cmd and run docker ps and get the Logstash container ID like below,

Logstash container ID

Then run docker inspect <container_id> and you will get an out like below, scroll down and go to the network section to get the Logstash container network IP and put that IP to the filebeat.yml file

Logstash container network IP

Next create a Logstash folder and inside create a logstash.yml file to configure the settings of Logstash.

# logstash.yml
http.host: "0.0.0.0"
xpack.monitoring.elasticsearch.hosts: ["http://elasticsearch:9200"]
xpack.monitoring.elasticsearch.username: "xxx"
xpack.monitoring.elasticsearch.password: "xxx"
pipeline.ecs_compatibility: disabled
config.reload.automatic: true

pipeline.workers: 6
xpack.monitoring.enabled: true
pipeline.batch.size: 6096

http.host:0.0.0.0 means the host or IP to bind in this case any network interface. xpack.monitoring.elasticsearch.hosts are the IP which Elasticsearch is running on here I have put the container name you can put the container network IP as well. You have to put the credentials of Elasticsearch in here to send the data to it. ECS in other words Elastic Common Schema is an open-source specification, developed with support from the Elastic user community. ECS defines a common set of fields to be used for storing event data, such as logs and metrics. In this case, we have set the default value as false. The “pipeline.workers” is the number of workers that will, in parallel, execute the filter and output stages of the pipeline. To better performance, you can increase this value by more than 6 by analyzing your CPU performance. In xpack.monitoring.enabled is set to true means to collect and monitor Logstash pipeline. Finally, pipeline.batch.size is used to control the number of events passed from input to filters and output section of Logstash.

Next, let’s look at the settings on Kibana. Create a folder name kibana and inside that folder create a kibana.yml file and put the following settings.

## Default Kibana configuration from Kibana base image.
## https://github.com/elastic/kibana/blob/master/src/dev/build/tasks/os_packages/docker_generator/templates/kibana_yml.template.js
server.name: kibana-01
server.host: "0.0.0.0"
elasticsearch.hosts: [ "http://elasticsearch:9200" ]
xpack.monitoring.ui.container.elasticsearch.enabled: true

elasticsearch.username: elastic
elasticsearch.password: elastic

Here as well you have to specify the Elasticserch credentials for Kibana to send requests and retrieve data from it. The rest is the same as I have explained above.

Finally, let’s look at the preprocessing part of Logstash where we will configure filtrations. Create a Logstash.conf file inside Logstash folder and put the below code I will explain it afterward.

input {
beats {
port => 5044 # File beat running port
}
}
## If you want to read gzip files then comment the above and uncomment the below code
# input {
# file {
# type => "gzip"
# path => "/usr/share/logstash/mylog/456_cdr.*.gz"
# # mode => "read"
# # start_position => "beginning"
# # file_completed_action => "log"
# mode => "read"
# file_completed_action => "log"
# # file_completed_action => "log"
# file_completed_log_path => "/usr/share/logstash/mylog/processed.log"
# sincedb_path => "/tmp/gzip.db"
# # codec => "gzip_lines"
# }

filter {
grok {
match => {
"message" => ['%{IPORHOST:remote_ip} - %{DATA:user_name} \[%{MONTHDAY:date}/%{MONTH:month}/%{YEAR:year}:%{DATA:time} +%{NUMBER:number}\] \"%{WORD:http_method} %{DATA:url} HTTP/%{NUMBER:http_version}\" %{NUMBER:response_code} %{NUMBER:body_sent_bytes} \"%{DATA:referrer}\" \"%{DATA:agent}\"']
}
}
date {
match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
target => "@timestamp"
}
geoip {
source => "remote_ip"
target => "geoip"
# add_field => [ "[location][lon]", "%{geoip.longitude}" ]
# add_field => [ "[location][lat]", "%{geoip.latitude}" ]
}

mutate {
add_field => {
"[location][lat]" => "%{[geoip][latitude]}"
}
add_field => {
"[location][lon]" => "%{[geoip][longitude]}"
}
}
}

output {
elasticsearch {
hosts => "172.17.0.2:9200"
user => "elastic"
password => "elastic"
index => "my_test_index"
}
}

First, you have to take the output data from file beats and here it gets as inputs. So you need to specify the port where Filebeat is running on. If you want to read gzip files then as I explained previously you can not read them from Filebeat you have to read them from Logstash itself so I have commented out that code for you if you want it then uncomment it and comment on the Filebeat section.

To process the incoming data you have to do them inside the filter option. Grok filter option is where you break down each field in a log line into multiple fields. Also, you can check your grok patterns whether are correct or not by using the grok debugger online tool.

Here match is where to look, and with which pattern. Here you can refer to this link for pattern types that are available and can get an idea out of that. When you write your pattern make sure you go with the message line as well together and write the pattern inside “” or ‘’. You have to use the same whitespaces in the pattern as in the message to match it and the symbols. Since then the pattern has to be in the same format as the message to match it with the message. When we want to extract specific data in a line and assign that value to a field we use %{<data_type>:<field_name>}. We use “\” before a value to tell Logstash that next to the data field is a string type. If you do not want to assign a particular data to a field you can just use %{DATA} in that case.

Comparison of the log line and the pattern

In the above, I have highlighted to explain to you in more detail how the pattern is being used. You can see that I have used spaces as well as “[”, “+” and “]” to match with the message since those will be ignored and not assigned to a field just for matching purposes those are used.

date field parses dates from fields to use as the Logstash timestamp for an event. Inside this tag, there is a match field which is also a pattern, and extracts that specific field and assigns it to the @timestamp field which is mentioned in the target.

Next GeoIP, using this you can get more information from the incoming IP if it is available in the log line from where that specific request came from. This is a valuable feature to have since it gives location details such as which city, state, country, postal_code as well as lat, lon values few examples that this featuers provides. You just have to specify the field name that you have extracted. In my case, it remote_ip.

Next is the mutate which is a feature that you can use for changes in your fields or create new data fields. For example, I have used the add_field tag which will extract a specific data field after the filtration and assign that value to the given data field. You can not mention this in the grok pattern since those fields come from the GeoIP feature itself. If you want to copy a data field into another data field you can use copy tag inside the mutate tag.

copy => {
"[geoip][longitude]" => "[location][lon]",
# <source_field> => <dest_field>
}

If you want to convert a data type into another data type you can use the convert tag inside mutate tag.

mutate {
convert => [ "[location][lat]", "float" ]
convert => [ "[location][lon]", "float" ]
convert => [ "[body_sent_bytes]", "integer" ]
}

Finally, After you have finished with the filtration part you can now send these filtered data to Elasticsearch from Logstash. As mentioned in the output tag for the host you have to put the Elasticsearch container network IP and the port it is running on. After that the username and password of Elasticsearch if this is incorrect you won’t be able to send the data to it. Finally the index name you can call this a database name as well and it will store your filtered data in this index.

After finishing everything before you run all 4 containers first comment out Filebeat and Logstash configurations in the docker-compose.yml file. The reason to do this is if you want to use the Map visualization to display data points in it then in Kibana, you have to first create a mapping in your index before. So run only the Elasticsearch and Kibana by opening the terminal and then go to your path where the docker-compose file is and run,

docker-compose up

Give it a few minutes in order to set and run the 2 containers. After it is started to run open your browser and go to localhost:5601 which is the Kibana web page and enter your Elasticsearch username and password and go to dev tools.

After that you will go the dev tool view and paste the below command,

PUT my_test_index 
{
"mappings": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}

At first, your index does not have any data but here you are creating an index and adding a mapping to it so that this index can identify the lat, lon values as a Geo_point data type so that Kibana is able to visualize the data points in the map visualization. After this is complete go to your docker-compose.yml file and uncomment the Logstash and Filebeat and run the below command again.

docker-compose up

When you run this only those 2 containers will start and the rest of the 2 containers which were running will not get affected. Now give it a few minutes and open Kibana web view and now go to the discover tab.

And after that you will get the below window,

Extracted and preprocessed logs from Logstash

From here you add the fields you want to analyze by clicking the “+” icon which will come when you put your cursor in a field name like below,

Selecting relevant fields for further analysis

So this is it for my tutorial in configuring ELK stack with docker container I hope you understood the theories behind the tools and the settings that we have used in order to deploy this. In future I will be focusing on explaining Elastic queries as well as how to enable SSL security feature in these pipeline as well. Have a blessed day to all.

--

--