Improve observability by adding logs to Grafana on a Kubernetes cluster

I have previously written about monitoring and observability of your Kubernetes cluster using Grafana and Prometheus but this focussed on metrics and left out a key component of monitoring — logs. In this article I address this by adding logs to Grafana using Promtail and Grafana Loki.

17 min readJan 15, 2024

Logs versus Metrics

Previously I have written about how to measure, monitor and alert on metrics. Whilst metrics are good for showing resource usage and identifying potential problems, that is only part of the picture.

The other part is logs. A log is a time ordered list of events that show what happened. we use logs extensively to help identify the cause of problems, including those identified through the metrics and alerts.

As a time ordered list, it is very important that the logged events across servers can be correlated, compared and sequenced. This requires centralised monitoring of all logs and that is the subject of this article.

Grafana Loki

Like Prometheus collects metrics, Grafana Loki collects, collates and saves logs in a way that Grafana can query and display them. So, to see our logs in Grafana we need Grafana Loki, or just Loki for short.

Loki is deployed from a Helm chart like Grafana. You can see how to install Helm here.

Persistent storage

Logs are a continuous stream of events. When Loki recieves these logs it must store them for you to query and display them at a later date. It breaks the logs down into two parts:

The actual text of the log line
The metadata and labels that define the log

Unlike other solutions, Loki does not index the log line itself, only the metadata and labels associated with the log and/or its source. It chunks up and compresses the logs separately. This means that Loki requires two stores when it comes to persistent storage:

A filesystem for the log indexes
An object store for the chunks

To be fair, Loki can be deployed in a number of ways, some of which only use a filesystem to store both.

For the purposes of this article, so we can see how Loki works, we will use a filesystem Persistent Volume (PV) and an object store based on MinIO.

MinIO

If you are not familiar with MinIO, you can read about it in my article here, which includes installation and configuration instructions.

The rest of this article assumes you have a MinIO (or S3) object store service available to use.

Note, the Loki Helm chart includes a MinIO implementation that you can enable. If you do this, your MinIO data will be spread across your cluster and if you want more control of where your data end ups, you should have a separate MinIO instance, which is what I do in this article.

Persistent Volume

Rather than creating our PVs by hand, we will use dynamic provisioning. With dynamic provisioning, when a PVC gets created, the dynamic provisioner will automatically create a PV for us and bind the PVC to the PV. This prevents the need to keep creating PVs manually.

First we will create a share for our dynamic provisioning. On the nfs-server, execute the following as root. First add a new shared folder to export.

/etc/exports

/pv-share *(rw,async,no_subtree_check)
/pv-share/grafana *(rw,async,no_subtree_check)
/pv-share/prometheus *(rw,async,no_subtree_check)
/pv-share/auto *(rw,async,no_subtree_check)

Your exports file may have other or different entries. What is important is the additional auto path.

You also need to create the new folder and restart the NFS service:

mkdir /pv-share/auto
systemctl restart nfs-server
systemctl status nfs-server

Now, log back into your k8s-master server (I am assuming here that you this is where you are running kubectl to access your cluster. If not, use whatever access you typically use to deploy to your cluster). We will now deploy the dynamic provisioner.

We will use Helm to manage this process. If you have not installed Helm previously, you can do as follows:

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh

Now add the required Helm repository:

helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner

Now install the provisioner into the monitoring namespace (replace the < > fields with your values:

helm install -n monitoring --create-namespace nfs-subdir-external-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner --set nfs.server=<nfs IP address> --set nfs.path=/pv-share/auto

You should now have the ability to create a PV automatically given a Persistent Volume Claim (PVC). To do this, create the following file:

test.pvc

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pvc
  namespace: monitoring
spec:
  storageClassName: nfs-client
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi

The important bit about this PVC is that the storageClassName is set to nfs-client. This is what the provisioner looks for and tells it to create a PV. It also means that the provisioner can live side-by-side with your manually created PVs.

Now apply the test PVC and look at the result with:

kubectl create -f test-pvc.yml
kubectl get pv

This should show you the PV it created:

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                            STORAGECLASS       REASON   AGE
...
pvc-6bc3785b-3fcf-4f18-9b4a-2aed29c9f282   1Gi        RWX            Delete           Bound    monitoring/test-pvc              nfs-client                  36s
...

So, how do you know that this is the one? Check the PVC you created:

kubectl get pvc -n monitoring

You should now see:

NAME                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS       AGE
...
test-pvc              Bound    pvc-6bc3785b-3fcf-4f18-9b4a-2aed29c9f282   1Gi        RWX            nfs-client         15s
...

There you can see the volume name we saw earlier.

Congratulations, you now have a dynamic PV provisioner that we can use with Loki.

Install Loki

First check that you have the right repository installed. Execute the following from wherever you run kubectl from. I do this from k8s-master.

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

You can look for all the different Helm charts using:

helm search repo loki

This will give something like:

NAME                         CHART VERSION APP VERSION DESCRIPTION                                       
grafana/loki                 5.41.5        2.9.3       Helm chart for Grafana Loki in simple, scalable...
grafana/loki-canary          0.14.0        2.9.1       Helm chart for Grafana Loki Canary                
grafana/loki-distributed     0.78.0        2.9.2       Helm chart for Grafana Loki in microservices mode 
grafana/loki-simple-scalable 1.8.11        2.6.1       Helm chart for Grafana Loki in simple, scalable...
grafana/loki-stack           2.9.12        v2.6.1      Loki: like Prometheus, but for logs.              
grafana/fluent-bit           2.6.0         v2.1.0      Uses fluent-bit Loki go plugin for gathering lo...
grafana/lgtm-distributed     1.0.0         6.59.4      Umbrella chart for a distributed Loki, Grafana,...
grafana/promtail             6.15.3        2.9.2       Promtail is an agent which ships the contents o...

That is a lot to choose from!

There are 3 ways (or strategies) to install Loki:

Monolith: a single pod that does not support scalability but is ok for gigabytes of logs a day
Simple Scalable: a scalable solution that can scale parts of the solution (read, write, backend)
Microservice: a fully scalable solution where each component can be scaled, allowing terabytes of logs to be processed each day

Monolith is the simplest to deploy and most restricted deployment and only uses filesystem storage. Microservice is the most complex to deploy. As we want to use MinIO, the strategy we will use is the Simple Scalable solution.

We will install Loki alongside Grafana and Prometheus in the monitoring namespace and use grafana/loki-simple-scalable, which will suit most logging requirements. By using a Helm chart, Loki will be automatically plumbed in to Kubernetes and will be able to access its log streams without further configuration.

When using Helm charts, you can override values by defining them in a values file. You can get a list of values with this:

helm show values grafana/loki-simple-scalable

This will show a very long list of options but the list we need to override is much smaller.

Create the following file (remember to replace < > fields with your own values):

loki-values.yml

loki:
  auth_enabled: false
  storage:
    filesystem: null
    type: s3
    s3:
      endpoint: http://<minio IP address>:9000/loki
      insecure: false
      accessKeyId: <minio access key>
      secretAccessKey: <minio secret>
      s3ForcePathStyle: true

read:
  replicas: 2
  persistence:
    storageClass: nfs-client

write:
  replicas: 2
  persistence:
    storageClass: nfs-client

backend:
  replicas: 2
  persistence:
    storageClass: nfs-client

gateway:
  service:
      type: NodePort
      nodePort: 31100

It is important to understand this as there is a lot going on, so let’s break it down.

storage:
    filesystem: null
    type: s3
    s3:
      endpoint: http://<minio IP address>:9000/loki
      insecure: false
      accessKeyId: <minio access key>
      secretAccessKey: <minio secret>
      s3ForcePathStyle: true

In this segment the first important line is filesystem: null. By default, the Helm chart will add filesystem configuration and you will end up storing chunks on your filesystem instead of MinIO. By setting this to null, this does not happen.

The endpoint points to the MinIO instance.

I have assumed that you will be storing your Loki information in the loki bucket. You will need to create this in MinIO manually.

Note that in my example, I have removed TLS from my MinIO implementation to avoid the complexity of certificate management. This is shown by the http path and insecure: true. I have also included the standard MinIO port of 9000.

If you use the built in version of MinIO, you should be aware that the endpoint is not a straight copy of your configuration but that it is constructed as a service endpoint based on the pod name you give as the endpoint.

Next, the access fields are from the access key and secret that you create within MinIO.

Then there is s3ForcePathStyle: true. This tells Loki to use a path that includes the bucket as the path and not as a subdomain, which is how s3 likes it (and hence MinIO too).

Next comes three identical blocks, shown here for read.

read:
  replicas: 2
  persistence:
    storageClass: nfs-client

In the simple scalable deployment of Loki, it sets up the architecture so that there are three pods, read, write and backend. Each of these contain multiple components of Loki and can be individually scaled.

By default, the Helm chart will request 3 replicas of each of these pods but I have overridden this to be 2.

Here’s a gotcha. The pod replication is designed such that no two instances of a pod will be deployed to the same node. It also ensure that no instance is deployed to the master node. In a 3 node system, your pods will stick in the Pending state. To avoid this on a 3 node cluster, I have limited the number of replica’s to 2.

The second part is to set the storageClass to nfs-client. You may remember from earlier that this triggers the automatic creation of the PV.

Note that it is very easy to get the cases wrong on some of these setting. Unfortunately, this will not cause the install to fail. It just means that your override will be ignored.

The next segment concerns the gateway, which manages the flow of log information into Loki.

gateway:
  service:
      type: NodePort
      nodePort: 31100

I have changed this to be a NodePort so that our servers that are external to the cluster can still send their logs to Loki.

If you follow my articles, you will know that I use an Australian bare bones cloud provider called Binary Lane. They do not provide LoadBalancer functionality and so I have to provide that myself. In this case, though, I do not want my Loki service accessible from outside the cluster and so a NodePort does exactly what I need.

Ok, so we now have a values override file that we can use without Helm chart to deploy Loki.

Now install Loki with:

helm install loki-monitoring grafana/loki -n monitoring -f loki-values.yml

You can now check that everything is up and running:

kubectl get pods -n monitoring

It may take a little while (1–2 minutes) to start. This is what it looks like on my system:

NAME                                                        READY   STATUS        RESTARTS   AGE
grafana-monitoring-655dbf8ddb-l7x4j                         1/1     Running       0          7d12h
loki-backend-0                                              2/2     Running       0          15m
loki-backend-1                                              2/2     Running       0          15m
loki-canary-8g48k                                           1/1     Running       0          15m
loki-canary-qsnl5                                           1/1     Running       0          15m
loki-gateway-589957f6f8-s7hds                               1/1     Running       0          15m
loki-monitoring-grafana-agent-operator-6d7d5b796d-xrzc8     1/1     Running       0          15m
loki-monitoring-logs-mhpzw                                  2/2     Running       0          15m
loki-monitoring-logs-qttjw                                  2/2     Running       0          15m
loki-read-8898b6b65-55l6g                                   1/1     Running       0          15m
loki-read-8898b6b65-mg4xm                                   1/1     Running       0          15m
loki-write-0                                                1/1     Running       0          15m
loki-write-1                                                1/1     Running       0          15m
nfs-subdir-external-provisioner-79bffb855c-9tgrq            1/1     Running       0          38h
prometheus-monitoring-kube-state-metrics-84945c4bd5-8m9pb   1/1     Running       0          8d
prometheus-monitoring-prometheus-node-exporter-55cm9        1/1     Running       0          8d
prometheus-monitoring-prometheus-node-exporter-877s9        1/1     Running       0          8d
prometheus-monitoring-prometheus-node-exporter-ntz6r        1/1     Running       0          8d
prometheus-monitoring-server-94f974648-r4mm7                2/2     Running       0          8d

If you have any problems, delete any pods stuck in the terminating state with (this can happen when you uninstall and reinstall or update Loki):
kubectl delete pod <PODNAME> --grace-period=0 --force -n monitoring
If you still have problems, you may have to uninstall Loki, and then reinstall. Note that this does not impact your NFS files or MinIO objects.

Check your services:

kubectl get svc -m monitoring

You should see that the gateway is accessible from your Virtual Private Cloud (PVC) subnet.

This is what I see on my system:

NAME                                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
grafana-monitoring                               NodePort    10.111.14.71     10.240.0.19   80:31000/TCP        7d12h
loki-backend                                     ClusterIP   10.106.41.200    <none>        3100/TCP,9095/TCP   16m
loki-backend-headless                            ClusterIP   None             <none>        3100/TCP,9095/TCP   16m
loki-canary                                      ClusterIP   10.103.139.91    <none>        3500/TCP            16m
loki-gateway                                     NodePort    10.99.124.132    <none>        80:31100/TCP        16m
loki-memberlist                                  ClusterIP   None             <none>        7946/TCP            16m
loki-read                                        ClusterIP   10.99.9.108      <none>        3100/TCP,9095/TCP   16m
loki-read-headless                               ClusterIP   None             <none>        3100/TCP,9095/TCP   16m
loki-write                                       ClusterIP   10.109.116.136   <none>        3100/TCP,9095/TCP   16m
loki-write-headless                              ClusterIP   None             <none>        3100/TCP,9095/TCP   16m
prometheus-monitoring-kube-state-metrics         ClusterIP   10.106.150.145   <none>        8080/TCP            8d
prometheus-monitoring-prometheus-node-exporter   ClusterIP   10.107.228.81    <none>        9100/TCP            8d
prometheus-monitoring-server                     NodePort    10.96.203.65     10.240.0.19   9090:31190/TCP      8d
query-scheduler-discovery                        ClusterIP   None             <none>        3100/TCP,9095/TCP   16m

You should see that the loki-gateway is available as a NodePort.

You can also check the MinIO console and check that data is starting to fill you Loki bucket.

Prove that you can access Loki from your nfs-server by running this on that server (remember to replace < > fields with your own values):

curl <k8s-master IP address>:31100 -v

You should get a response of OK. If you do, we are ready for the next part — installing the Promtail agent.

Promtail

Now we have access to Loki from outside the cluster, we can start collecting logs from our non-Kubernetes servers (the Kubernetes logs are automatically collected as part of the Helm chart deployment).

This is done using an application called Promtail. It is installed as a service (also known as an agent) on a server and scrapes the logs it is configured to read. It then passes these back to Loki for processing and storage.

Loki supports a number of different agents but, as I have used Promtail successfully before, I thought I would show you how to use this one.

We will need to repeat the following installation on each server (outside the cluster) that we want to monitor.

Installing Promtail

First log into the server. You will need to be able to execute commands as root. Note that if you follow my articles you will know that I am working with Ubuntu servers.

As Promtail is not available via a package manager, you need to download the binaries. You can find the latest version here: https://github.com/grafana/loki/releases. Look under Assets and expand the list to find the Promtail release you need.

You will need to download the zipped file, uncompress it and make it executable. If you do not have unzip, you can obtain it with apt install unzip -y on Ubuntu. We then need to move Promtail to a more suitable place that is on the path.

curl -O -L https://github.com/grafana/loki/releases/download/v2.9.3/promtail-linux-amd64.zip
unzip promtail-linux-amd64.zip
chmod a+x promtail-linux-amd64
cp promtail-linux-amd64 /usr/local/bin/promtail

You should now be able to check that it installed correctly with:

promtail --version

Configuring Promtail

Promtail needs a number of configurations to be set to ensure it works and we will place the configuration file under /etc/promtail.

As Promtail creates logs of its own and we need to ensure we collect these too. We will place them in /var/log/promtail.

mkdir /etc/promtail
mkdir /var/log/promtail

Now create the following file as root (remember to replace the < > fields with your own values):

/etc/promtail/promtail-config.yaml

server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: 'http://<loki IP address>:31100/loki/api/v1/push'

scrape_configs:
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: <server>_varlogs
          __path__: /var/log/**/*.log

Note that if you change this file you will need to restart the Promtail service.

In my set up, as Loki is running as a NodePort, which exposes the port 311000 on all the cluster nodes, I will use the k8s-master IP address. I could, if required, add a load balancer into my VPC but for now I do not need that.

server: Promtail itself acts as a server and the first two lines set up the port it will use.
positions: Tells Promtail where to keep track of the logs it has scraped and how far through it has got.
clients: In this model, Promtail pushes its log changes to Loki and so we need to tell it where to send them
scrape_configs: this is where Promtail will look for logs (see here for more detail)

Under scrape_configs we can define any number of log locations. Each set of locations is classed as a job and when we look at our logs through Grafana, we can search by these job names.

Each job defines the targets it will scrape and, in this case, it will scrape all logs in the /var/log folder and subfolders on this sever (localhost — the only option for Promtail), that end in .log.

To assist with filtering and searching of logs, we are able to classify our logs by giving them labels, in this case, the log folder is tagged with job: varlogs. Note that we can also select by filename in Grafana as part of the additional metadata that is collected with the log.

You may remember I said that Loki stores logs as an index of labels and metadata and as chunks of logs.

This makes it very efficient as it does not index the logs. It does mean that every log requires a label to add to the index to allow it to be found. Promtail will add other labels, such as filename to the logs it feeds back to Loki but you should define your own here to make it easier to find in Grafana. You may also want to create separate jobs for separate log files and paths.

Remember to change the labels for each server so you can find them effectively in Grafana.

We are now ready to start scraping logs for this server and streaming them back to Loki so we can view them in Grafana.

Set up as service

We need to set up Promtail as a service to ensure it is always running. Create the following file as root:

/etc/systemd/system/promtail.service

[Unit] 
Description=Promtail service 
After=network.target 
 
[Service] 
Type=simple 
User=root 
ExecStart=/usr/local/bin/promtail -config.file /etc/promtail/promtail-config.yaml 
Restart=on-failure 
RestartSec=20 
StandardOutput=append:/var/log/promtail/promtail.log 
StandardError=append:/var/log/promtail/promtail.log 
 
[Install] 
WantedBy=multi-user.target

Now we will start and enable this service, noting that we first have to tell systemctl to load this new service definition. We will then start it, make sure it started and then enable it to be started if the server reboots.

systemctl daemon-reload
systemctl start promtail
systemctl status promtail
systemctl enable promtail

Remember to repeat this on any other server and also, if required, change the scrape target.

Connecting Loki to Grafana

Ok, you should now have Promtail installed and running on all your non-Kubernetes servers, Loki running on the cluster along with Grafana (installed in previous articles).

Now the final step, connecting Loki to Grafana.

Log in to Grafana and head to Home -> Data sources. If you have been following along, you should see your Prometheus data source.

Click + Add new data source
Find and select Loki
Give your data source a name or use the default Loki
Add a connection URL, in this example, http://loki-gateway (no need to specify the port)
Leave the rest as is
Click Save & test

All being well, you should end up with a green tick.

Note that Kubernetes sets up an internal DNS. When it creates a service, it adds a number of DNS entries based on the following names:

<service name>.<namespace>.svc.local
<service name>.<namespace>.svc
<service name>.<namespace>
<service name>

This is why our configuration for the gateway can just be loki-gateway, using the last form of DNS entry. If this is ambiguous, feel free to use any of the other forms instead.

What to do if all is not well

If you do not get a green tick, then you need to start looking into the problem. If you go to your kubectl location (ie: k8s-master for me), use:

kubectl get pods -n monitoring

First check that your Loki pods are running. You can check the status with:

kubectl describe pods <pod name> -n monitoring

You can look at logs with :

kubectl logs <pod name> -n monitoring

If you need to, you can add -f to this to follow it in real time.

I tend to clear the screen with Command K (on a Mac) and then try the operation again to make it clear what happened.

There are too many things that can go wrong for me to explore here but now you now how to view the logs (and you can use this to look at your Grafana logs), then with the help of Google you will hopefully have it up and running.

Restarting a pod generally does not generally fix the problem so you can always reinstall with:

helm uninstall loki-monitoring -n monitoring
helm install loki-monitoring grafana/loki -f loki-values.yml -n monitoring

Exploring the data source

Ok, so I am assuming you have managed to get a green tick and your data source is available. To check this we will now explore this data source with Grafana.

Viewing the Logs

Once Loki is connected to Grafana, you can now select Home -> Explore in our Grafana console.

It is important to select Loki as the data source you wish to look at.

You can now start building your queries. Under Select label choose job.

You will now see that the Select value field now selects all the Loki label values associated with the label job.

Select a PromTail job (eg: gw_varlogs) and then click + and select a new label of filename. You will now see all the filenames associated with the gw_varlogs label in the Select value field.

Select a filename and then click the big blue button at the top called Run query.

You will now see a time ordered list of log entries from that file at the bottom of the page, including a graph of activity. Note that the default is 1,000 lines at a time.

A detailed description of how to use Grafana is beyond this article but hopefully you have some idea about how to explore your logs. You can even turn your logs into metrics, display them on a dashboard and alert on them!

Summary

Although this is a long article, we covered a lot of ground.

In this article we looked at how we provide persistent storage to our Kubernetes cluster using a dynamically provisioned PV backed by an NFS server as well as a standalone MinIO server.

We then set up some configuration overrides and deployed Loki from a Helm chart that installed Loki using the Simple Scalable strategy.

After showing that Loki was up and running, we then used PromTail to scrape logs from our non-clustered servers and send them back to Loki.

Finally we connected Loki to Grafana and saw how we can look at logs to monitor our system.

If you found this article of interest, please give me a clap as that helps me identify what people find useful and what future articles I should write. If you have any suggestions, please add them in the comments section.