Optimizing Log Management with a Hybrid Loki-Fluentd Infrastructure on AWS

Published in

Nethive Engineering

13 min readDec 7, 2023

Streamlining Costs in Container Logging & Log Visualization

Context: While addressing the challenges presented by our client’s production environment, we delved deep into the architecture analysis to design an optimal solution for monitoring and archiving logs from an entire network of Docker containers operating on a Linux platform. This preliminary assessment was crucial, as we wanted to ensure a solution that was not only efficient but also adhered to specific essential criteria:

Secure Transmission: It was mandatory that data was transferred using SSL encryption, ensuring the integrity and confidentiality of information during its transit.
Physical Separation: The logged data had to be stored in a location geographically separate from the production Data Center, providing an added layer of security and redundancy.
Stability and Cost-Effective: Equally important, the solution had to ensure consistent and reliable performance, but without excessively burdening the budget.

Solution: How to Tackle the Challenge?

In the overview we examined, we encountered:

Virtual Machines: A substantial network of operational virtual machines (VMs).
Docker Containers: A vast fleet of Docker containers, each with its unique requirements and workloads.
Data Volume: A massive influx of information, with tens of thousands of logs generated every minute.

With these elements on the table, the question becomes: how to proceed effectively?

Data Transmission

In considering options for data transmission, we initially leaned towards Fluentd. This mature and highly performant application, written in Ruby, seemed the ideal choice. However, to tailor it to the nuances of our environment, densely populated by containers, we integrated the “fluent-plugin-docker” plugin.

With the data transport mechanism established, a pivotal question arose: where to direct this data?

The immediate solution might seem to be sending the data directly from each individual machine to a node in our Data Center. But, upon reflection, this strategy doesn’t appear as the most efficient. Indeed, it’s crucial to maintain meticulous control over the direction and volume of data flows.

Our Proposal: Given the complexities of managing disparate data flows, we proposed consolidating all data streams onto a single platform. By doing this, we could dramatically reduce the traffic directed towards the Data Center by focusing on one efficient communication channel. This centralized approach offers numerous benefits:

Efficiency: By streamlining data into a single channel, we can reduce redundancy, minimize the risk of data loss, and ensure faster data processing.
Manageability: Centralization simplifies monitoring, troubleshooting, and data management, making it easier for IT teams to oversee and control the data flow.
Security: Limiting the data paths to a single point reduces vulnerabilities and provides a more secure data transit environment.
Cost-effectiveness: Reducing the number of data streams can lead to savings in both infrastructure costs and operational overheads.

In essence, this strategy not only improves the efficiency of data transmission but also optimizes the overall operational processes.

Diagram

Hypothesis and Architecture Definition

At first glance, the solution might seem straightforward: since the provider has its own Data Center, the problem should already be resolved. However, the situation is slightly more nuanced.
If the provider had chosen to use their Data Center, they would have had to expose a public address and a specific port for data transfer. This would create a potential security vulnerability for the Data Center. Beyond that, for legal reasons, the data could not be stored there.

So, here emerges our definitive solution: the Cloud!

Transferring data to a cloud platform like AWS turned out to be the ideal approach. This would allow us to access the data in read-only mode from the provider’s Data Center and then display such data through a Grafana server with a public address.

Diagram Final Architectural Design

Let’s start configuration!

Data Forwarding: To forwarding data from Docker machines to the aggregator machine “proxy,” proceed with the following installation/configuration steps.

curl -fsSL https://toolbelt.treasuredata.com/sh/install-ubuntu-jammy-td-agent4.sh | sh
apt update
apt upgrade
apt install fluentd-apt-source
sudo apt install td-agent

cat <<EOF> /etc/td-agent/td-agent.conf
<source>
@type forward
port 24224
bind 0.0.0.0
tag docker
</source>
<match *.**>
@type forward
<server>
name server-proxy
host server-proxy
port 24224
</server>
</match>
<match .**>
@type stdout
<format>
@type json
</format>
</match>
EOFsystemctl start td-agent.service
systemctl enable td-agent.service

Important Note: While starting each Docker container, it’s crucial to add specific configuration strings. This enables the collection of container logs and their transmission to localhost, where Fluentd is ready to receive them.

- log-driver=fluentd \
- log-driver=fluentd - log-opt tag="docker.{{.Name}}-{{.ID}}" \

Server-Proxy Configuration:

Role of the Server Proxy
The core function of this server is to act as an intermediary for data collection and aggregation. It receives data from various container servers and subsequently funnels them into a unified stream directed towards the AWS server.

The process occurs as follows

The data is transmitted to the Fluentd server through port 24224.

Fluentd filters and collects all logs tagged with “docker.”
Once aggregated, this data is forwarded to localhost on port 3100.
At this point, the Loki Docker container is waiting and ready to receive and process the forwarded logs.

Graphical representation of the data flow process

curl -fsSL https://toolbelt.treasuredata.com/sh/install-ubuntu-jammy-td-agent4.sh | sh
apt update
apt upgrade
apt install fluentd-apt-source
sudo apt install td-agent

cat <<EOF>/etc/td-agent.conf
<source>
@type forward
port 24224
bind 0.0.0.0
</source>
<filter docker.**>
@type parser
key_name log
format json
reserve_data true
time_key time
time_format %Y-%m-%dT%H:%M:%S%z
reserve_time true
</filter>
<filter .**>
@type record_transformer
enable_ruby
<record>
file ${record['file']}
func ${record['func']}
level ${record['level']}
msg ${record['msg']}
params ${record['params']}
time_stamp ${time.strftime('%Y-%m-%dT%H:%M:%S%z')}
</record>
remove_keys "FlowSet ID"
</filter>
<match docker.**>
@type loki
url "http://localhost:3100"
tenant "docker"
flush_interval 10s
flush_at_shutdown true
buffer_chunk_limit 2m
buffer_queue_limit 30
<label>
source
container_name
</label>
</match>
<match .**>
@type stdout
</match>
EOF

AWS Cloud Preparation

Definition of Amazon S3 Bucket: Amazon S3, which stands for Simple Storage Service, is one of the most renowned storage services provided by Amazon Web Services (AWS). This service serves as object storage, making it ideal for storing files of all sizes and types, ranging from small notes to large-scale videos or entire database backups.

AWS S3 Configuration Guide

S3 Bucket creation on AWS

In the AWS console, use the search menu to find “S3” and click on the “S3” option that appears in the results.
Once on the S3 page, click the “Create bucket” button.
In the “Bucket name” box, enter “bucket-s3-aws”.
Select the desired AWS region. For example, for Europe (Frankfurt), choose “eu-central-1”.
Keep all other settings as default and click “Create bucket”.
Once the bucket is created, make sure to copy and save the access_key_id and secret_access_key for future reference.

Loki-writer Container configuration

Loki-Writer installation :

Proxy server will host the “Loki-Writer” container, which will be configured to listen on localhost at port 3100. This container will be responsible for receiving, aggregating, and subsequently sending logs to the AWS S3 bucket.

cat <<EOF>/loki-write/config/loki.yaml
auth_enabled: true
http_prefix:
server:
http_listen_address: 0.0.0.0
http_listen_port: 3100
log_level: info
memberlist:
join_members: ["localhost"]
dead_node_reclaim_time: 30s
gossip_to_dead_nodes_time: 15s
left_ingesters_timeout: 30s
bind_addr: ['0.0.0.0']
bind_port: 7946
gossip_interval: 2s
ingester:
lifecycler:
join_after: 10s
observe_period: 5s
ring:
replication_factor: 1
kvstore:
store: memberlist
final_sleep: 0s
chunk_idle_period: 5m
wal:
enabled: true
dir: /loki/wal
max_chunk_age: 10m
chunk_encoding: snappy
flush_op_timeout: 10s
schema_config:
configs:
- from: 2020–08–01
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/index_cache
shared_store: s3
aws:
bucketnames: bucket-s3-aws
region: eu-central-1
access_key_id: ****** insert AWS credentials
secret_access_key: ****** insert AWS credentials
s3forcepathstyle: true
limits_config:
enforce_metric_name: false
ingestion_rate_mb: 2000
ingestion_burst_size_mb: 200
split_queries_by_interval: 15m
chunk_store_config:
max_look_back_period: 14d
table_manager:
retention_deletes_enabled: true
retention_period: 14d
query_range:
align_queries_with_step: true
max_retries: 5
parallelise_shardable_queries: true
cache_results: true
frontend:
log_queries_longer_than: 5s
compress_responses: true
max_outstanding_per_tenant: 2048
query_scheduler:
max_outstanding_requests_per_tenant: 1024
querier:
query_ingesters_within: 2h
compactor:
working_directory: /loki/compactor
shared_store: s3
compaction_interval: 5m

Creating command run of Docker Container.

The function of this container will be only to writing logs to AWS.

cat <<EOF> /loki-write/run.sh
#!/bin/bash
docker container rm - force loki-write \
docker run -d - name loki-write \
-p 3100:3100 \
-p 7946:7646 \
-v /loki-write/config/:/etc/loki \
 - restart always \
grafana/loki:2.8.2 \
-config.file=/etc/loki/loki.yaml \
-target=write \
-config.expand-env=true
EOF

Starting Docker Container with Initial Log Forwarding to AWS

sh /loki-write/run.sh

If you don’t receive any issues during the Docker container startup phase, you will observe the presence of three new items within your S3 bucket: two folders labeled “docker” and “index,” along with a file named “loki_cluster_seed.json.” Well done! You’re halfway through completing the entire process.

Data Reading on AWS via Loki-Reader Server

The Reading Process

To perform data reading, a series of services need to be started.
These include nginx, loki-read, the AWS datasource, and the Grafana datasource.
To streamline and optimize the configuration, we’ve opted to use “docker-compose.”

Installing Docker Compose

#!/bin/bash
sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates curl gnupg-agent software-properties-common -y
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io -y
sudo groupadd docker
sudo usermod -aG docker $USER
sudo systemctl enable docker

Creating the docker-compose.yaml File for Loki-Reader

To configure the loki-reader server, you need to create a file named docker-compose.yaml. This file will outline the required services and their specific configurations

cat <<EOF> /loki/docker-compose.yaml
version: "3.8"
networks:
loki:
volumes:
grafana:
services:
init:
image: grafana/loki:2.8.2
user: root
entrypoint:
- "chown"
- "10001:10001"
- "/loki"
volumes:
- ./loki:/loki
networks:
- loki
loki-gateway:
image: nginx:1.19
volumes:
- ./config/nginx.conf:/etc/nginx/nginx.conf
- ./config/certs:/etc/loki/certs:ro
- ./config/pw:/etc/loki/pw:ro
ports:
- "8080:80"
- "3100:3100"
networks:
- loki
loki-read:
image: grafana/loki:2.8.2
volumes:
- ./config:/etc/loki/
- ./config/certs:/etc/loki/certs/
- ./rules:/loki/rules:ro
ports:
- "3100"
- "7946"
command: "-config.file=/etc/loki/loki.yaml -target=read -config.expand-env=true"
networks:
- loki
restart: always
deploy:
mode: replicated
replicas: 3
loki-write:
image: grafana/loki:2.8.2
volumes:
- ./config:/etc/loki/
- ./config/certs:/etc/loki/certs/
ports:
- "3100"
- "7946"
command: "-config.file=/etc/loki/loki.yaml -target=write -config.expand-env=true"
networks:
- loki
restart: always
deploy:
mode: replicated
replicas: 3
EOF

Creating the datasources.yaml File

cat <<EOF> /loki/config/datasources.yaml
apiVersion: 1
datasources:
- access: proxy
basicAuth: false
jsonData:
httpHeaderName1: "X-Scope-OrgID"
secureJsonData:
httpHeaderValue1: "docker"
editable: true
isDefault: true
name: loki
type: loki
uid: loki
url: https://loki-gateway
version: 1

Creating the loki.yaml File

cat <<EOF> /loki/config/loki.yaml
auth_enabled: true
http_prefix:
server:
http_listen_address: 0.0.0.0
grpc_listen_address: 0.0.0.0
http_listen_port: 3100
grpc_listen_port: 9095
log_level: info
common:
compactor_address: https://loki-write:3100
memberlist:
join_members: ["loki-read"]
dead_node_reclaim_time: 30s
gossip_to_dead_nodes_time: 15s
left_ingesters_timeout: 30s
bind_addr: ['0.0.0.0']
bind_port: 7946
gossip_interval: 2s
ingester:
lifecycler:
join_after: 10s
observe_period: 5s
ring:
replication_factor: 3
kvstore:
store: memberlist
final_sleep: 0s
chunk_idle_period: 1m
wal:
enabled: true
dir: /loki/wal
max_chunk_age: 1m
chunk_retain_period: 30s
chunk_encoding: snappy
flush_op_timeout: 10s
schema_config:
configs:
- from: 2020–08–01
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/index_cache
shared_store: s3
aws:
bucketnames: loki-server
region: eu-central-1
access_key_id: ****** insert AWS credentials
secret_access_key: ****** insert AWS credentials
s3forcepathstyle: true
limits_config:
enforce_metric_name: false
ingestion_rate_mb: 2000
ingestion_burst_size_mb: 20
split_queries_by_interval: 15m
chunk_store_config:
max_look_back_period: 14d
table_manager:
retention_deletes_enabled: true
retention_period: 14d
query_range:
align_queries_with_step: true
max_retries: 5
parallelise_shardable_queries: true
cache_results: true
frontend:
log_queries_longer_than: 5s
compress_responses: true
max_outstanding_per_tenant: 2048
query_scheduler:
max_outstanding_requests_per_tenant: 1024
querier:
query_ingesters_within: 2h
compactor:
working_directory: /loki/compactor
shared_store: s3
compaction_interval: 5m
EOF

Creating the nginx.yaml File

cat <<EOF> /loki/config/nginx.yaml
error_log /dev/stderr;
pid /tmp/nginx.pid;
worker_rlimit_nofile 8192;
events {
worker_connections 4096; ## Default: 1024
}
http {
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] $status '
'"$request" $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /dev/stderr main;
sendfile on;
tcp_nopush on;
upstream read {
server loki-read:3100;
}
upstream write {
server loki-write:3100;
}
upstream cluster {
server loki-read:3100;
server loki-write:3100;
}
server {
listen 80;
listen 3100 ssl;
server_name server-loki;
auth_basic "loki auth";
auth_basic_user_file /etc/loki/pw/pwd;
location / {
proxy_read_timeout 1800s;
proxy_connect_timeout 1600s;
proxy_pass https://server-loki;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "Keep-Alive";
proxy_set_header Proxy-Connection "Keep-Alive";
proxy_redirect off;
}
location /ready {
proxy_pass https://server-loki;
proxy_http_version 1.1;
proxy_set_header Connection "Keep-Alive";
proxy_set_header Proxy-Connection "Keep-Alive";
proxy_redirect off;
auth_basic "off";
}
ssl_certificate /etc/loki/certs/server.crt;
ssl_certificate_key /etc/loki/certs/server.key;
ssl_protocols TLSv1.2 TLSv1.1 TLSv1;
proxy_ssl_verify on;
proxy_ssl_trusted_certificate /etc/loki/certs/ca.pem;
location = /ring {
proxy_pass http://cluster$request_uri;
}
location = /memberlist {
proxy_pass http://cluster$request_uri;
}
location = /config {
proxy_pass http://cluster$request_uri;
}
location = /loki/metrics {
proxy_pass http://cluster$request_uri;
}
location = /ready {
proxy_pass http://cluster$request_uri;
}
location = /loki/api/v1/push {
proxy_pass http://write$request_uri;
}
location = /loki/api/v1/tail {
proxy_pass http://read$request_uri;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
location ~ /loki/api/.* {
proxy_pass http://read$request_uri;
}
}
}

SSL Certificate Management (optional)

To ensure communication security, it’s necessary to place SSL certificates in the designated path

/loki/config/certs

Authentication Configuration for Nginx

If you intend to secure access with authentication, you need to create a file named “pwd” containing user credentials

/loki/config/pwd
Inside this folder, create a file named “pwd.”
In the file, input your credentials in the format:

user:password

Remember to replace “user” and “password” with your actual credentials.

Summary of Required File: To ensure proper configuration, verify the presence of the following files in their respective folders:

/loki/docker-compose.yaml
/loki/config/datasources.yaml
/loki/config/loki.yaml
/loki/config/nginx.conf
/loki/config/certs/ca.pem
/loki/config/certs/ca.key
/loki/config/pw/pwd

Starting docker-compose of Loki Reader

Go to the /loki folder and launch loki-reader with following command:

docker compose up -d

Reading Logs on Grafana

Starting Grafana Server

From your browser, access the Grafana console.

Datasource Configuration:

Once you’re in the Grafana interface, locate the gear icon (settings) and click on it.
From the side menu, select the option to add a new datasource.
In the list of available datasource types, choose Loki.
In the “URL” field, input the address of your proxy server: https://server-proxy:3100.

If you have configured your connection to use security certificates:

Enable the options: “Basic Auth”, “TLS Client Auth”, and With “CA Cert”.
In the Basic Auth Details section, input the credentials (username and password) you defined in the pwd file.
In the TLS/SSL Auth Details section, enter the server name. Ensure that this name matches what’s defined in the FQDN or SAN field of the certificate.

In the Custom HTTP Headers section:

Add a new header named X-Scope-OrgID.
Set its value to “docker,” in accordance with what’s defined in the datasources.yaml file.
Click the Save button to confirm the changes.

Visualization of Logs on Grafana

Grafana Dashboard

Start Grafana and access your dashboard in the browser.
Log Exploration.
Search and click on the “Explore” option in the top bar.
Start filtering the logs by clicking on the “label filter” bar.

Select the logs

In the dropdown menu that appears, select the desired value to specifically filter your logs.
If you see a preset filter like “line contains” you can remove it by clicking the “X” next to it.

Formatting Logs

Once you’ve set the desired filters, navigate to “Operations”.
From the menu that appears, move to “Formats”.
Here, select the “LogFmt” option to view the logs in a specific format.

Select Time Range

In the top right corner, you’ll find a time range selector.
Select the desired interval, for example, “1h”, to display logs from the last hour.
Click on “Run Query” to initiate the search and display the logs based on the selected filters and options.

Quick Search for Specific Values

For quickly searching a specific term, like “error,” use the “line contains” function. Enter the term you want to search for to filter the results accordingly.

Considerations & Conclusion

The adoption of Grafana for log visualization has proven its effectiveness, particularly when dealing with logs exported from Docker in JSON format. Thanks to this key-value conformity, data presentation becomes immediate and clear. However, it’s crucial to bear in mind that data doesn’t always come in such a user-friendly format as JSON.

When encountering logs that don’t adhere to this format, all hope is not lost. There are several approaches to address this inconvenience. One solution could involve adapting the data formatting at the source itself, by modifying Fluentd or the server configurations responsible for sending logs to the proxy. An alternative, which might require a bit more effort but offers greater flexibility, involves direct intervention in the configuration of the proxy itself. This can be achieved by implementing parsing based on regular expressions to transform the received data into the desired format.

In summary, even though challenges might arise in the process of centralizing and visualizing logs, there are tools and methods available to overcome them, ensuring accurate and timely analysis of information.

Example:

If we were to encounter a non-JSON log like the one below:

[2013–02–28 12:00:00 +0900] alice engineer 1

<parse>
@type regexp
expression /^\[(?<logtime>[^\]]*)\] (?<name>[^ ]*) (?<title>[^ ]*) (?<id>\d*)$/
time_key logtime
time_format %Y-%m-%d %H:%M:%S %z
types id:integer
</parse>

Through the configuration mentioned above, the result will be:

time:
1362020400 (2013–02–28 12:00:00 +0900)
record:
{
"name" : "alice",
"title": "engineer",
"id" : 1
}

The interplay between Fluentd, AWS, and Grafana has not only showcased technological synergy but has also underscored how modern engineering can address intricate requirements with targeted and sustainable solutions. This amalgamation has enabled us to monitor our container operations in real-time while simultaneously upholding high-level security standards.
However, what makes this achievement particularly remarkable is the achieved cost efficiency, an exemplification of how innovative solutions can reconcile resource optimization and cost containment. In an era marked by rapid digital evolution, this experience reminds us of a fundamental truth: in an increasingly interconnected world, with the right strategies and a willingness to embrace cutting-edge technologies, secure, detailed, and economically sustainable information management is attainable.

References and Sources

Information on Fluentd filters and plugins was sourced from the following official documents:

https://docs.fluentd.org/parser/regexp

https://docs.fluentd.org/parser/json

For the correction and optimization of the JSON configuration, I relied on the following article:

Fluentd: Add log path to the record written by Marta Tatiana.

Grafana Loki Documentation:

https://grafana.com/docs/loki/latest/configuration/examples/