From Chaos to Clarity: Solving Log Management Challenges in Kubernetes

Ahmad Ghassemi
MCINext

--

Logging in apps often means writing stuff to files or showing it on the screen (stdout), and we need to keep those logs safe somewhere. But how do we gather them all up?

In MCI NEXT, we deal with Thousands of microservices that log into files within pods on various large Kubernetes clusters, and gathering these logs poses a challenge for our SRE team.

Tools like Fluentd, Fluentbit, and Vector help by rounding up logs from different places, like files and stdout. But it gets trickier when we want logs from files inside containers tucked away in pods in Kubernetes.

The most common solutions to gather logs from applications in pods are two approaches:

Sidecar Container

Using a sidecar is a suitable approach for a small-scale Kubernetes cluster. However, when we talk about large scale, the situation will be different. some disadvantages of the sidecar pattern in large scale cluster are:

  • Resource consumption: More containers mean more memory consumption and CPU utilization. From a resource perspective, it’s sometimes more efficient to host all the processes inside a single container, or to run complementary tasks at the node level.
  • Management complexity: Sidecars increase the total number of containers you need to monitor and manage, not to mention the relationships between them. You’ll need a comprehensive monitoring system that can track, for instance, exactly how the failure of a sidecar container affects your main applications.
  • Update compatibility: It may require more work to ensure that updates to your main application container are compatible with the sidecar that supports it, and that those updates can carry across all the related components without issues.

Exposing logs directly from the application

Exposing application logs directly to the logging store can cause issues in large-scale environments or under heavy service loads:

  • Vulnerability: This approach allows multiple access points for users, which can increase the threat of a security breach.
  • Complexity: Serving multiple clients in one instance of an application/database adds an extra level of complexity to the codebase and database maintenance.
  • Global problems: If a technical problem occurs on the provider’s end, it can lead to issues for all users. This may apply to uptime, system upgrades, and other global processes.

Solution

In the SRE team, we decided to develop Harvester: a centralized log collector system.

The benefits of this centralized system are to reduce resource consumption, management complexity, and security risks. Additionally, scaling can be done easily without serious headaches.

With this system, you can simply create a manifest, provide some information, and voilà! Logs are effortlessly collected, rotated, and transferred to a Kafka system as easy as pie!

You only need to create a manifest like this:

apiVersion: mcinextlog/v1alpha1
kind: Harvest
metadata:
name: harvest-sample
namespace: logging-system
spec:
podLabels: #labels of target pod
app: logger-pod
containerName: logger-pod # container name in specified pod
logs:
- backend: elastic # support just elastic backend
path: /app/json.log #path of log in container(if the source is std path will be ignore)
source: file # support file and std input
tag: emptydir #tag of log if empty just rotate the log file
type: json #parse type (txt, json)
rotationSize: 20M #rotate file on specified size(if not set the default value is 10M)

Before delving into the main idea, it’s essential to become acquainted with two key terminologies:

  • Harvest: “harvest” or “harvest data” refers to a manifest containing various data, akin to the one described above.
  • Harvester: A “harvester” is both a pod and an application that generates vector and logrotate configuration files and also communicates with the operator.

Architecture

The centralized log collector system consists of two main components:

  • Operator
  • Harvester

Operator

If you are not familiar with operators, you can read this article.

This operator exposes two APIs

  • Get generation

Returns the generation number that changes when it detects a change in at least one of the Harvests.

  • Get harvests

Returns all Harvests

Harvester

Harvester is a DaemonSet written in Golang, acting as an agent on every node.

The harvester pod contains three containers: vector, logrotate, and the harvester application itself. This pod has an emptyDir that stores the configuration files of vector and logrotate generated by the harvester. Additionally, the /var path and CRI socket file are mounted as hostPath to the pod.

To understand how the log system works in this context, let’s delve into the process:

When you apply a harvest manifest, the generation number in the operator will increase. The Harvesters check the generation number every 10 seconds. If the generation has changed, they will call the ‘get harvest’ API to retrieve all the harvests.

When the harvester on a node retrieves the harvests, it searches for pods that need to gather their logs according to the podLabel and containerName specified in each of the harvests.

After identifying pods on a node, the harvester will connect to CRI to obtain the path of the log files on the related node.

For example, a log file located in a pod at /app/log.json is found on the related node at /var/lib/.../nonRelatedName.json. Using the CRI API, it easily returns the absolute path on the node.

After retrieving the path of the log file, the system will generate a Vector and Logrotate configuration file for each path. These files will utilize harvests data to generate configurations such as rotation size, tags, and other relevant parameters.

In the Harvester environment variables, there is a setting to configure Kafka bootstrap servers and the corresponding topic for sending the collected logs to Kafka. Subsequently, you can consume these logs and distribute them to any desired destination.

Note that this harvester should be compatible with any location where the log file might be located, such as Persistent Volume Claims (PVC), emptyDir volumes, ephemeral storage, and so on. This compatibility ensures that the harvester can effectively gather logs from various sources within the Kubernetes cluster, regardless of where the log files are stored.

References:

https://www.orbitanalytics.com/multi-tenant/

--

--