Centralizing logs at Naukri.com with Kafka and ELK stack

Logs are important part of any system as they give deep insight into what is happening with the system. They also helps in figuring out what went wrong when something unexpected happens. Most applications generates logs in one form or the other and they are generally written into files on the local disk. A web application consists of various components and each of them generates logs. Few of them are mentioned below.

  • Access logs from web servers like Nginx, Apache
  • Logs from back end applications (Java, PHP, python etc.)
  • Logs from Database, Cache system, Queues etc.
  • System Logs from the server

This blog is about how we centralized our logs at naukri.com using Filebeat, Kafka, Logstash, Elasticsearch and Kibana.

Why Centralized Logging?

When we are building large system comprising of multiple applications, each application or component is deployed on different servers and even single application is deployed on multiple servers to cater to varying amount of traffic and performance. In such systems, the applications are generally interconnected and it becomes increasingly difficult to access all logs and correlate them when things go wrong. It also leads to under-utilization of the rich information that could be provided by logs. Moreover, it compounds the possibility of compromising security as the person debugging the system has to be given access to all the servers.

Architectural Approach

Any centralized logging approach needs the following components:
Collection, Transport, processing/enrichment, Storage, Analysis and Alerting. There are tools already available to do each of them but orchestration is required to ensure that they are connected seamlessly. An inhouse tool AURA, is used for alerting and log correlation and RCA. Detailing for AURA as a alert management and correlation is a topic for another day.

Initial Approach:

We started off with processing at the logs at source with a custom agent placed at each server and pushing the processed logs to a central store powered by an Elasticsearch cluster. Kibana was used for visualization and analysis of the logs.

This approach had the following advantages

  • Ability to filter out unwanted logs or pre aggregate at the source itself thereby reducing the overall data transfer.
  • Log processing at the source means lesser resource burden at the central location.

While this worked fine, we started facing challenges soon

  • Changing of processing logic is very cumbersome as the logic has to be propagated to all servers
  • Spikes in log production leading to Elasticsearch cluster unable to handle the load led to intermittent dropping of logs
  • A central temporary log retention became necessary to minimize data loss.

Final Approach:

In order have a more resilient, easy maintenance and efficient system, we changed the stack to have minimal work done at the source and do all processing centrally. A kafka layer was also introduced to hold the logs temporarily to handle spikes and unexpected storage failures.
Each server has one or more application writing logs to the local disk. A Filebeat agent on each server, configured with multiple prospectors, picks up logs from different applications like nginx, apache, mysql and web applications and pushes to a central Kafka cluster.

Sample Filebeat Config to get prospectors from a folder and output to a Kafka cluster:

Sample Filebeat prospector to collect logs from Nginx, application logs and MySQL slow logs:

Each prospector is provided with additional fields to segregate different type of logs into different Kafka topics.
Spring based microservices also pushes logs directly to Kafka using the Logback kafka appender.

Multiple Logstash instances consumes from the kafka cluster, segregates the logs, enriches them with more information and push them to a central Elasticsearch cluster. In nginx logs it does the task of adding geolocation, user Agent details information and URL summarization. Logstash also takes care of segregating data into different date rotated Elasticsearch indices of individual applications.
Sample Logstash parser config:

The above process converts a sample nginx log line which looks like

to a structured and enriched with more information like below

The Elasticsearch cluster is provided with appropriate index templates to store data for different type of logs. Each of the indices is date rotated to allow easy curation and faster query for alerting.
Sample index template for nginx logs:

Kibana is used for visualization and analysis of logs. An inhouse tool AURA uses the data on Elasticsearch for alerting, metrics reporting, log correlation and RCA.

Hardware for the setup

Kafka Cluster: 3 * (4 core CPU, 16GB RAM and 300GB SSD)
Elasticsearch cluster: 4 * (24 core CPU, 64GB RAM and 2TB SSD)
Logstash: 8 * (4 core CPU, 16GB RAM and 100GB SSD)
AURA & Kibana: 1 * (4 core CPU, 16GB RAM and 100GB SSD)

Some Metrics:

No. of log lines consumed per day: ~1,600,000,000
Peak Average message on kafka: 35000 messages per second
Average delay in log generation to display in Kibana: 3 sec
Elasticsearch cluster size: 4TB

Index data rollover and Curation

With logs for each day consuming about 500GB of storage, retaining them for longer period became a costly proposition. Moreover general application logs are not needed for longer duration. But for access logs we wanted to retain the information about request trends, error trends and response time patterns for a longer period of time. The exact logs are kept for just 3 days to allow debugging and full details. So a custom curator system is put in place to aggregate the logs. The curator does the following

  • Aggregates and retains information about total requests made for every URL, response code, percentile response times, request medium (desktop/mobile/bots) and application server.
  • Aggregation is done for every 3 hours period on access log indices.
  • Run on indices older than 3 days and reducing their storage size from ~500GB to ~5GB while still retaining trend information, thereby, allowing performance, error and traffic data comparison for much longer time duration.
  • Deletes other detailed application log indices older than 3 days

Conclusion

The complete system has enabled log centralization with reduced costs and providing a mechanism for debugging also.

  • Total traffic trends and performance of every application is tracked.
  • Any anomaly in terms of traffic trend or errors is reported within a minute.
  • Traffic distribution to different host server is also tracked.
  • Geo-location information for every application is also provided.
  • Application logs are correlated with the actual request providing faster debugging.

A related blog about how we tuned the Elasticsearch Garbage collection for this centralized logging can be found at Garbage Collection in Elasticsearch and the G1GC.

Follow us on our official blog at Naukri Engineering.