Soluto by asurion
Published in

Soluto by asurion

Overwhelmed by logs? Kubernetes to the rescue!

In Soluto we use a microservices architecture, whereby each service can ship its logs to our third party logs provider and query them to create alerts and dashboards and to improve visibility.

As logs multiplied, the issues started…

The continuous growth of our company has presented a new challenge to our logs collection system. The weekly addition of new services to our backend architecture led to an exponential growth of our log volume. All good up to here. However, this reached a point where our logs quota was reached and our logs shipping was blocked by our provider on a daily basis.

This meant no logs collection, no visibility to our services, and a lot of hassle in getting things up and running again, every single day.

When the logs quota is reached, the on-call company engineer receives an alert, and needs to figure out what caused the writing of so many logs? In most cases, the reason would be that a service is ‘acting up’ or a database is down. All services that use this resource would then start writing error logs such as “Can’t write to database”, “Connection timed out” etc’.

For example, if a service handles 1000 messages from a queue, and for some reason, cannot write them to the database, it’ll end up writing 1000 repetitive error messages about the same issue — not that useful.

At this point the on-call company engineer has two possible solutions: either to increase the logs quota or to find the offender microservice and fix it to go back to a normal state.

Increasing the logs quota never helped us for more than a few days.

As established in the Parkinsons’s Law , increasing the number of resources will only lead to people maximizing their use of it. Therefore raising the quota will encourage inefficient use of resources and the same problem will recur with a bigger bill as a bonus.

We therefore usually opt for the second solution: the on-call engineer contacts the relevant service owner who identifies the source of the problem, fix it and reduces the logs count — profit. Unfortunately though, finding the offender microservice and asking the responsible team to reduce the logs is not a straightforward task, and causes friction in the team and frustration for the engineer. It is also heavy in resources, as it requires both the on-call engineer and the additional team to intervene.

This is how dealing with such an incident would look like in our internal communication channels:

This issue was a recurring one in Soluto for several years, but not anymore since Kubernetes came along.

The Solution

Since the introduction of Kubernetes two years ago, our services are gradually migrated onto the new orchestrator, and new services are exclusively created on it.

In Kubernetes, log shipping is greatly simplified thanks to the fact that the services write their logs to stdout while Fluentd collects them behind the scenes and sends them to our third party logs provider. Having the log shipping placed in a centralized service has provided us much better control over it.

The adoption of Kubernetes and Fluentd and our constant effort to reduce have provided an opportunity to reevaluate the logs quota issue. We wanted to implement to handle repetitive logs. After performing analysis to ensure we choose the best path for solving the problem. We found that Fluentd could offer a solution with throttling.

We then discovered fluentd-plugin-throttle , a cool fluentd plugin that allows logs throttling by configurable groups. It felt promising but one key feature that was missing was the capacity to ignore some logs (in our case, logs of type “Information” or “Debug” are sent to a different logs storage, and shouldn’t be throttled). We’ve added this capability to the plugin, opened a (which hasn’t been merged yet), installed it from the local fork, and defined a configuration to filter logs by pod groups.

Here is a snippet of our Fluentd configuration:

@type tail
read_from_head true
tag "kubernetes.*"
<filter kubernetes.**>
@type kubernetes_metadata
<filter kubernetes.**>
@type throttle
group_key kubernetes.pod_name
group_bucket_period_s 60
group_bucket_limit 60
key level
regex /^([Ii]nfo|[Ii]nformation|[Dd]ebug)$/

As you can see, we collect all the logs from the container files, enrich them with the kubernetes_metadata plugin, then use the throttle plugin to define the logs grouping. Finally, we set the throttling to 60 logs/min.

We’re enjoying great results who took effect immediately: logs are being throttled at all times, without the developers even noticing. The logs often contain a lot of redundant information, and 60 messages per minute are enough in order to understand the root cause of any issue that may arise and to take appropriate action.

Putting it simply: we found a solution to our problem, the on-call engineer is no longer getting annoyed by those alerts, other company engineers get the flag on time rather than when the service is already blocked and well…. Everyone is happy.

You’re welcome to try it as well! Happy throttling!

Found this interesting? Share!




Engineering. Product. UX. Culture.

Recommended from Medium

Bash Profiles: Quick Tips to Make Life Easier

Passwordless connection string to Azure SQL database from .NET Core API — Łukasz Gąsior

DroidKaigi 2021 Call for Speakers are open!

Top Five CSS Mistakes

How to Win as a Remote Team and Productivity Advice from Quire Community Manager

Getting Started with Distributed SQL on Azure Kubernetes Service

Distributed SQL Summit Recap: Mainframe Modernization

How many 8’s are there In the following sequence whlch are Immediately preceded by 6 but not…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Shai Katz

Shai Katz

More from Medium

PostgreSQL and typeorm — Practical transactions

Configure WSO2 Identity Server for Self Service Authorization using Admin REST APIs

Refactor from Monolith Workflow to Micro-Workflows

How to Enhance Your Deployment With Continuous Testing in CI/CD