Log Analytics — How to Log Smart — Elastic Stack
Every system generates logs whenever an event occurs, we can control the frequency, format and amount of information. In order to make use of all this information and utilize it to achieve the desired results, you need a support system to:
- collect all the data
- extract and format the relevant information out of it
- insert it into a big data system
- fetch and consume the collected information, easily and fast
- analyze it in real time
What is a Log ?
Log is information, it can be in any kind or format. It mainly consist of 2 things:
Log = TimeStamp + Data
Types of Logs
Each system/application may have multiple sources of information. But almost every system has at least following:
Application Error Logs — These logs are generated every time there is an handled or unhandled exception or error condition occurred in the application. This usually gives the low level and detail information on why, when and where the exception has happened.
Business Error Logs — These logs are also generated by the application but they are not exceptions. In your application whenever an expected business behavior is compromised or failed, you can log that event.
Example: User entered invalid password to login to the system. It is not an exception but a business error. Some developers may argue that it is a business information, instead of error. It doesn’t matter as long as you are recording and using this information.
Application Information Logs — It can be any type of information related to application that can be used to improve and optimize the system.
Example: In a banking system, the application owner may log every event of users. All pages visited by a user after login to the system, the order of page visit, how many clicks, IP address or location of login, the device type used to the system.
Example: We can capture and log the information of how a new users came to our website, via google/facebook/twitter etc.
Business Information Logs — This information is related to business operations and it can be helpful in understanding the user behavior. Capturing critical events or milestones can be helpful in predictive analysis. You can identify and analyze patterns that results in historical and transactional data, these patterns can be utilized to identify potential opportunities, as well as risks for the future.
Network Logs — It depends on the skills of the team and architecture of the application. It may consist of:
- If you have multiple domain, then you can capture and log, count and location of each domain visit
- Log all API calls/requests
- Webserver internal logs — Nginx/Apache etc
- Load Balancer internal logs
- Health Check logs of network services
System/Server Logs — These are the logs from hardware and core services:
- Metrics of all servers that gives real time information of
Resources used and available
CPU usage
Memory usage
- Internal logs of database servers
SQL Server
Mongo
PostgreSQL
- Health check logs of all servers
- Number of requests handled by each server
- Linux server logs
- Windows application and security logs
- AWS CloudWatch logs or similar from Azure/GCP
Challenges with Logs Management
We know that amount of logs can range from few to several millions every day. Logs are data rich and can be used in a wide variety of use cases. However, they come with their own set of challenges:
Format is not consistent — Every system, server and software involved in an application may generate logs in different format. It would be very difficult and require expertise in understanding the formats and information hidden in those logs, especially the ones which are not generated from code/application.
Logs are Decentralized — A typical mid-size web application may consist of few API servers, few database servers, web servers, load balancers. All of them will generate logs in different format and in separate physical locations/local storage. In order to study them together and fetch a meaningful information out of them, there is a need of centralized log management system.
Time format and zone is not consistent — With the advent of cloud computing and distributed computing, the resources supporting an application are spread across multiple physical location and time zones.
Some system/servers are smart enough to generate logs in local timezone and format, but others may generate in UTC timezone and universal datetime format. Correlating all these logs across multiple system at the same time can be a daunting task.
Logs are Unstructured — Log data is unstructured and thus it becomes even more difficult to perform analysis on it directly. So, it is very important to perform low level processing and transformation to convert raw logs data into the right structure that is easy to store and fetch, also efficient in terms of storage space and processing power it consumes.
Application of Logs
Logs contains rich information about the following and can be used in several ways:
- State and behavior of a system
- Behavior of the users
- Ecosystem it is running inside
- Predictive analytics
- Frequency of events
- Troubleshooting
- Auditing
- Health Check, Alerts and Notifications
Practical Application scenarios
Use-Case — Capture and study user behavior, time spent on website, physical location of user, device type used. Using this information business can take following decision:
- Whether to invest more time and resources in IOS/Andrioid app or desktop website
- Marketing team can use above information to study the trend and do focused marketing
Use-Case — Analyze the spike in CPU and Memory usage of the servers. It can help in making following decisions:
- use IaC(Infrastructure as Code) to upgrade or downgrade servers in cloud to save cost, as well as provide smooth user experience.
- Review the applications and services running on the server, that has spike. They may need improvement in code or configuration.
- Spike may be because of high number of online users or traffic.
Use-Case — Keep a real time health check by integrating a monitoring and notification support system.
- Can be used to send notification on Slack, Cellphone, Email etc., whenver there is a spike in error logs. In my project we send notification whenever there are more then 10 error logs in last 5 minutes.
- Send a notification if any service or server went down.
Use-Case — Tracking the Login events
- Create alerts and notifications when there are too many unsuccessful login attempts. It may be an organized attack on the system.
- Too many logins from the system users, it can be an indication that there might be some problem with active session of the system users.
Next Article in this series
I will explain in detail how I am using Elastic Stack along with Redis/Kafka/RabbitMQ, Grafan, Nagios, PagesDuty, OpsGenie, Slack to achieve everything we discussed above.
Please don’t forget to clap if you like the article.