Logs are a critical part of any system, they give you deep insights about your application, what your system is doing and what caused the error, when something wrong happens. Virtually every system generates logs in some form or another, these logs are written to files on local disks. When you’re building enterprise level application, your system goes to multiple hosts, managing the logs across multiple hosts can be complicated. Debugging the error in the application across hundreds of log files on hundreds of servers can be very time consuming and complicated. A common approach to this problem is building a centralized logging application which can collect and aggregate different types of logs in one central location.
There are many tools available to which can solve some part of the problem but we need to build a robust application using all these tools.
There are total four parts in centralized logging application — Collect logs, transport, store and analyse. We are going to look at each of this parts in depth and see how we can build an application.
All the applications create logs in different ways, some applications log through syslogs and other logs directly in files. When you see a typical web application running on a Linux server, there will be a dozen of more log files in
/var/log and also a few application-specific logs in the home directories and other locations. Basically, there will be logs generated by different applications at a different place.
Now, consider you have a web application running on the server and if something goes down, your developers or operations team need to access log data quickly in order to troubleshoot live issues, you would need a solution which can monitor the changes in the log files in almost real-time. To solve this issue, you can follow replication approach,
- Replication Approach:
In the replication approach, files are replicated to a central server on a fixed schedule. You will setup a cron job that will replicate your files on Linux server to your central server. A one-minute cron job might not be fast enough to troubleshoot when your site is down and you will be waiting for the relevant log data to be replicated.
Replication approach would be good for analytics, if you need to analyze log data offline for calculating metrics or other batch related work, replication approach might be a good fit.
If you have multiple hosts running then logs data can accumulate quickly. There should be an efficient and reliable way to transport this data to the centralized application and ensure data is not lost.
There are many frameworks available to transport log data. One way is directly plug input sources and framework can start collecting logs and another way is to send log data via API, application code is written to log directly to these sources it reduces latency and improves reliability.
If you want to provide a number of input sources you can use:
- Logstash — Open Source Log collector, written in Ruby
- Flume — Open Source Log collector, written in Java
- and, Fluentd — Open Source, written in Ruby
These frameworks provide input sources but also support natively tailing files and transporting them reliably. These frameworks are a better fit for more general application.
To log data via APIs, which is generally a more preferred way to log data to a central application, these are following frameworks that can be used.
- Scribe — Open Source Software by Facebook, written in C++
- nsq — Open Source, Written in Go
- and, Kafka — You would have heard about Kafka, highly used, Open Source Software by Apache, written in Java
So this was about the transport, now let’s what would be the efficient way to store such a large amount logs data.
Now we have transport in place, logs will need a destination, a storage where all the log data will be saved. The system should be highly scalable as the data will keep on growing and it should be able to handle the growth over time. Logs data will depend on the how huge your applications are if your application is running on multiple servers or in many containers it will generate more logs.
There are a couple of things, that we need to keep in mind while deciding the storage.
- Time — For How long should it be stored?: Storage system depends on how long you would like to store your data. If the logs are for long-term and do not require immediate analysis, it can be archived and saved on S3 or AWS Glacier as they provide a relatively low cost for a large amount of data. If you only need a few days or months of logs you can use distributed storage system like — Cassandra, MongoDB, HDFS or ElasticSearch also works well. And, lastly, if you want to store just a few hours of data, you can use Redis as well.
- Volume — How huge your data would be?: Google and Facebook create a much more large volume of data in a day compared to a week’s data of a simple NodeJs application. The storage system you choose should be highly scalable and scale horizontally as your data increases.
- Access — How will you access the logs?: Storage system you choose also depends on how you access the logs. Some storage systems are not suitable for real-time analysis, for example — AWS Glacier can take hours to load a file. AWS Glacier or Tape Backup won’t work if you need to access data for troubleshooting analysis. ElasticSearch or HDFS is a good choice for interactive data analysis and working with raw data more effectively.
Logs are meant for analysis and analytics. Once your logs are stored in a centralized location, you need a way to analyze them. There are many tools available for log analysis, if you need a UI for analysis, you can parse all the data in ElasticSearch and use Kibana or Greylog2 to query and inspect the data. Grafana and Kibana can be used to show real-time data analytics.
This is the last component in the centralized logging application. It’s nice to have an alerting system which will alert us to any change in the log patterns or calculated metrics.
Logs are very useful for troubleshooting errors. It’s far better to have some alerting build in the logging application system which will send an email or notify us then to have someone keep watching logs for any changes. There are many error reporting tools available, you can use Sentry or Honeybadger. These aggregates repetitive exceptions which give you an idea of how frequently an error is happening.
Alerting is also useful for monitoring hundreds of servers, logs will be sending the status of different applications and you can setup alert system to check whether your system is up or down. Alerting is really useful in error troubleshooting, monitoring and threshold reporting. Riemann is very good software for monitoring and alerting.
So in part 1, we talked about all the available softwares and components we need to build a centralized logging application, in Part 2, we will start building our application, starting with Transport, we will see how to setup Transport component for a simple NodeJS application which will send logs to a central system
If you liked the article, don’t forget to show some love and follow me to receive the updates on Part 2 of this series.
And with that, I will this article. I am open for suggestions and feedback on the technical details of the blog post. As always, I’m always looking to work on amazing projects. If you are working on something interesting, let’s talk! You can comment here to share what you think. Stay tuned for part 2 :)
Also Hey, if you like what you just read, please like this resource by hitting the green “Recommend” icon, share it on Twitter or Facebook so that other people may also stumble upon this.