Grafana Loki — Our journey on replacing Elastic Search and adopting a new logging solution at Arquivei

Published in

Engenharia Arquivei

7 min readSep 19, 2023

Logging is an important aspect of any software system. As a company grows, collecting, storing, and presenting logs can become a challenge, especially in microservices architectures where many different services are running concurrently and interacting with each other.

Having multiple log sources without a logging aggregation tool turns out to be a huge problem since storing and analyzing them independently is inefficient and impractical. To solve that problem various log management tools can help, and each one of them can be better for a specific scenario.

Our legacy logging solution

At Arquivei, the originally implemented logging solution was the EFK Stack: Elastic Search, Logstash, Kibana, Fluentbit, and Fluentd.

Our previous logging architecture using Elastic Search

For the company-specific scenario, the logging flow would work as follows:

Fluentbit agents running on hosts were responsible for gathering the application logs and sending them to Fluentd.
Fluentd would then process the logs based on a set of custom rules and publish them to Redis.
Finally, Logstash subscribed to Redis to consume the logs that would be indexed by Elastic Search and presented via Kibana.

That solution served us well until a few years ago, but with the company’s growth the amount of logs also increased. With that increase, we started facing some problems:

Complexity — Different components running on VMs required a big effort from the SRE team to maintain and update. Also, in case of failures, debugging was not so easy as any of those pieces could be the cause.
Cost — Due to the large quantity of processed logs, 11 Elastic nodes with 2 TB SSD each were needed, plus the other components of the infrastructure would lead to a total cost of about 60.000$ per year, for only 30 days of logs retention.
Indexing Problems — At the time there were also problems related to indexes that frequently would result in logs being lost.

With that in mind, the team started looking for a new logging solution, that is where we found out about Loki and realized it could be less complex to maintain, cheaper, and could help us to solve our problem of losing logs.

Loki and the PLG Stack

Loki is a log aggregation tool developed by Grafana Labs, and unlike other logging solutions, it does not index log content itself but creates labels (key/value pairs) that are used as metadata to describe a log stream, similar to what Prometheus does with series.

That approach has advantages and disadvantages, Loki is more cost-effective and performant compared to Elastic, the trade-off is it has less text search capabilities as it does not directly process the logs to create an optimized data structure for search.

To collect the logs from applications and send them to Loki we can use Promtail.

Promtail is an agent that can be installed on a set of hosts and configured to collect the logs and apply labels before shipping it to Loki. Also, it is worth noting that Promtail already has a standard configuration for Kubernetes that includes some default label configurations (namespaces, jobs, pods, etc).

That is particularly good for our scenario since most parts of our applications are running in Kubernetes, so we get out-of-the-box labeled logs by simply installing Promtail and integrating it with Loki.

Then, at Grafana we can use Loki as a data source and view logs, create dashboards, and everything else Grafana provides.

The Prometheus, Loki, and Grafana stack is also known as PLG and the fully implemented architecture is as follows:

Detailing more about the architecture:

Loki is deployed on a main Kubernetes Cluster named “Office”, where Grafana is also deployed.
All the production GKE Clusters have Promtail installed and integrated with Loki. As this is a multi-cluster scenario, there is a configuration on Promtail to also add an extra label with the corresponding cluster name, this helps us on filtering the logs according to the cluster where the application is hosted.
Loki is integrated with GCS storage for increased logging retention (3 months) and less storage cost. With GCS it is possible to create lifecycle rules to store logs for a specific amount of time and also change the storage class by period.
Loki is deployed in microservices mode, this approach grants us a highly scalable environment with the possibility of scaling each Loki component individually and is suitable for large quantities of logs production environments.

Implementation

Diving on the implementation we mainly used the Promtail and Loki helm charts with a set of custom parameters according to our needs.

Promtail installation:

helm repo add grafana https://grafana.github.io/helm-charts

helm upgrade --install promtail grafana/promtail  -n promtail --values values.yaml

default values: https://github.com/grafana/helm-charts/blob/main/charts/promtail/values.yaml

Below is the added section to configure the specific label cluster with the cluster name:

extraArgs:
- -client.external-labels=cluster=<cluster-name>

Also, it is important to remember that for Promtail to get the logs from applications running on nodes with specific taints, it is necessary to add the corresponding tolerations on the values file.

Loki distributed installation:

helm repo add grafana https://grafana.github.io/helm-charts

helm install loki grafana/loki-distributed -n loki --values values.yaml

default values: https://github.com/grafana/helm-charts/blob/main/charts/loki-distributed/values.yaml

When using the Loki distributed helm chart, it is needed to enable and size each component individually according to your needs. Based on our log volume, Loki was configured as follows:

Ingester — 5 replicas (Statefulset)
Distributor — 3 to 5 replicas (autoscaling)
Querier — 4 to 10 replicas (autoscaling)
Query Frontend — 3 to 5 replicas (autoscaling)
Query Scheduler — 3 replicas
Index Gateway — 3 replicas
Compactor — 1 replica
Memcached Frontend — 1 replica
Memcached Chunks — 2 replicas
Memcached index queries — 3 replicas

Details about each component can be found in Grafana Loki’s official documentation.

In addition, taking into account resource utilization and preemptive capabilities, we established two distinct node pools to partition the Loki components in alignment with their specific requirements.

Since Queriers have high CPU and memory usage and can scale based on the number of queries being executed, they could be allocated on a separate node pool with e2-high CPU preemptible instances, named Loki Preemptible.

Meanwhile, the other components were allocated on a node pool that uses e2-standard-4 instances, named Loki Critical.

Regarding the configuration for GCS integration, the following snippet can be used to enable it on the values.yaml file:

loki: 
  # configuring gcs to store the logs
    configs:
      - object_store: gcs
        store: boltdb-shipper
        schema: v11
        index:
          prefix: index_loki_
          period: 24h
        chunks:
          prefix: chunk_loki_
          period: 24h
  storageConfig:
    boltdb_shipper:
      active_index_directory: /var/loki/index
      shared_store: gcs
      cache_location: /var/loki/cache
      resync_interval: 5s
      index_gateway_client:
        grpc_client_config:
          max_recv_msg_size: 1.048576e+11
          max_send_msg_size: 1.048576e+11
    gcs:
      bucket_name: <bucket-name>

Testing and tunning

Before establishing Loki as our new log aggregator solution, we evaluated its performance under different usage and load situations.

Our main goal was to monitor key indicators such as response time, processing performance, data throughput, resource consumption, and scalability to ensure that Loki was fast, efficient, and reliable and also identify the best setup for better performance.

For conducting our tests we used the K6 framework integrated with the k6-loki library and to run the tests we used TestKube.

Each test was configured with a different amount of VUs (Virtual Users), iterations, and executed 3 times to guarantee consistency. Also, during each test, we monitored the pods autoscaling behavior, CPU, and Memory usage on the nodes.

During our initial tests, we experienced lots of slowness and timeouts while running simple queries for multiple users in parallel, even with resource usage not being a problem.

By searching and discussing it on Loki forums, we observed that below configuration parameters could be impacting the performance:

Split Queries By Interval — Window of time to split queries to be performed in parallel.
Querier Max Concurrent — Maximum number of queries a Querier can run concurrently.
Parallelise Shardable Queries — Allows for parallelization based on sharding queries.

Initially, the greatest performance improvement was obtained by setting the parallelise_shardable_queries parameter to “false”. With this setting activated we would get lots of “context canceled” and timeout errors, but after deactivating this setting these errors ceased to occur.

Then, there was still slowness in returning results when running multiple heavy queries in parallel and with higher time range queries.

After testing multiple combinations of values what proved to be the best for our scenario was split_queries_by_interval: 24h and max_concurrent: 10.

Regarding the results for this configuration, during the tests Loki was able to execute 400 queries with 90 days range (20 virtual users executing 20 queries each in parallel) in about 10 minutes.

The results of these tests were extremely important to ensure a better experience for our users.

A snippet of the mentioned configuration and other relevant parameters is below:

config: |
    auth_enabled: false

    server:
      http_listen_port: 3100
      grpc_server_max_concurrent_streams: 500
      grpc_server_max_recv_msg_size: 1.048576e+11
      grpc_server_max_send_msg_size: 1.048576e+11
      http_server_read_timeout: 30m
      http_server_write_timeout: 30m
      http_server_idle_timeout: 30m    querier:
      max_concurrent: 10
      query_timeout: 30m 
      engine:
        timeout: 30m                                    ...
    limits_config:
      enforce_metric_name: false
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      max_cache_freshness_per_query: 10m
      split_queries_by_interval: 24h
      per_stream_rate_limit: 30MB
      per_stream_rate_limit_burst: 50MB
      max_entries_limit_per_query: 20000
      ingestion_rate_mb: 20
      ingestion_burst_size_mb: 30
      max_query_parallelism: 32
      max_query_length: 0h
      max_query_series: 1000000
      max_entries_limit_per_query: 10000
      max_global_streams_per_user: 10000
                                    ...
    query_range:
      parallelise_shardable_queries: false
      align_queries_with_step: true
      max_retries: 5
      cache_results: true                                    ...    query_scheduler:
      max_outstanding_requests_per_tenant: 10000000
      grpc_client_config:
        max_recv_msg_size: 1.048576e+11
        max_send_msg_size: 1.048576e+11

Final thoughts

During our tests, Loki has demonstrated itself as a great tool. Although achieving peak performance requires some exploration of the documentation and tuning of configuration parameters, Loki stands out as a comprehensive solution that can be easily deployed in a Kubernetes cluster. It simplifies the process of deploying, ingesting, and querying logs from various workloads.

Also, one of our objectives while adopting a new logging solution was motivated by the desire to optimize costs. In this regard, Loki has delivered great results, reducing our monthly logging infrastructure expenses by approximately 70%. This translates to estimated monthly savings of $3.700, culminating in annual savings of over $40.000.

These substantial financial gains, coupled with the reduced complexity and effort required to operate and maintain the solution were the deciding factors to embrace Loki as Arquivei’s ultimate logging platform.