Microservice Logging

In today’s world, microservices bring many benefits to us. But at the same time, It brings some difficulties with it. In this article, we will talk about logging.

Most of the time the logging importance is ignored but knowing your system’s health and tracking critical parts save your life. Also, logging can bring big analytic data about your system’s behaviour.

On monolithic systems, generally, logs are kept on the same filesystem. Determining the errors and debugging systems are easy. But we can’t say the same thing about microservices.

Correlation

The main thing is your services will be independent from each other. But when you try to track any process flow, that flow will touch multiple instances. For tracking clues, you have to know the unique identifier of your process flow like request-id or correlation-id. That information has to be moved with your process flow. The question is “When do we create that correlation? or “Who will create it?”.

There are many ways to do that, but personally, I prefer to do that on a gateway level. Especially if you provide API sets, your entrance will be your gateway and your flow will be forked to your services.

Example for Nginx:

location / {
proxy_pass http://my-microservice;
proxy_set_header X-Request-Id $request_id;
}

For more information about $request_id

That identifier has to be moved to other parts.

Commonized Pattern

The second tip is the logging pattern. Whenever you try to ingest your logs, you have to make sure about the pattern. If every service uses a different pattern or different type, it will be more difficult to ingest that data.

At the minimum, logs should include the following information:

Correlation id, service name/id, IP address, message received time in UTC, severity.

Optionally, depending on your purpose you can add time taken, method name, call stack, etc.

Also depends on the use case, you have to decide your log rotation strategy. End of the day, storing logs will consume your money. At this point, you have to clean fewer severity logs. For example, you can define rotation rules depending on severity like: “Trace logs can be kept about 1 week.”, “Exception logs can keep about 1 month.” etc.

Centralized storage

To reduce management complexity, you shouldn’t prefer to use a separate store for every instance. There are many ways to use shared storage. Like you can send logs to storage instance on the fly via http/socket, or you can write to file with small rotates and upload to storage via ftp/rsync and so on.

Search

Alright, let’s assume we created a correlation between log entries, defined a pattern for logging, and sent it to storage. Now, how is the team gonna use them to investigate any problem? If your system lives under AWS, Cloudwatch, Athena and S3 will save your life. Also alternatively Papertrail/ELK would work.

Let’s try to build our log infrastructure with ELK Stack with some AWS services. For monitoring, harvesting, visualizing purposes, the ELK stack is all in one solution.

Lets explain above diagram part by part.
For some common metrics, elastic beats are useful. For instance, collecting memory/cpu consumptions, or some service health checks etc.
We can assume every service will generate its own logs on its own flow with correlation id. Those logs will be streamed to one shared instance. To control high traffic, we need to add some buffer mechanism to avoid causing any bottleneck on elasticservice side. To solve that problem, AWS Kinesis will help us. It is basically working like a data streaming gateway. In that service, we will create a new data stream, after that we will create an one delivery stream for sending logs to elasticsearch. Also for data transformation, we can use some lambda functions to convert raw line into some json.
MetricBeat will monitor resource consumption for every single instance.
AuditBeat will make our general audits over services and send those logs to ElasticSeach.
HearthBeat will check uptime of our services and send those logs to ElasticSearch.

In conclusion, when we try to detect any problem, there will be more correlation points. For example, when you get any request failure, you will be able to find related requests for all services. After finding a problematic service, you will be able to see the monitoring/audit/uptime information for that service

--

--

Insights from the tech minds behind our technology

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store