I have been dealing with logs and analysis for a while now and met quite a lot of confusion regarding this topic. Not only at the C-suite or other high ranking non-technical meetings but even among the technical personnell who are not that familiar with different aspects of logs. This inspired me to write a short overview.
When talking about logging, there are five different aspects to consider.
1. Event logging
The reasons for logging and log analysis might vary from the need to deal with system performance to financial fraud or insider threats. Whatever the reasons, it is important to actually log something in your systems to have something to analyse. At the end of the day it all drills down to the quality of log event.
Deciding what to log and where to log is quite tricky: you can’t log everything (load issues, analysing problems, compliance issues etc) but you still need to analyse to reach your goal. For in-house development it’s an additional overhead to develop loggers and to log meaningful events. For out-of-the-box products it is always the question of log formats, logging details etc. In a nutshell, it is important to define, what are your requirements for event logging and actually make sure that log events are created when actions in the system are occurring. Without this step there is nothing to analyse. Or to look at it from another perspective — if you’re not planning to analyse anything, you don’t perhaps need the overhead of logging either.
2. Collecting, handling and archiving
Log analysis should not create an additional load to the production system. Also, it is important to safeguard the log files in case of an attack: adversaries tend to remove their trails of action and the best way to do that is to remove any kind of logs that might reveal their actions.
In some cases, system administrators and log analysts are not within the same user rights domain. Thus — it’s not wise to give access to production systems only for the reason of accessing log files.
This is why it is important to transport the log files to a more suitable location, e.g. a centralised log server, where proper access rights and archiving retention policies are set. Any kind of load and analysis done in the remote log repository should not have an effect to the production environment and a remote location gives you an additional layer of event integrity by keeping your events secure from direct compromise.
There are different solutions for that (from Syslog servers to Splunk, ElasticSearch etc) but since the goal would be to keep the logs as raw as possible, a simple syslog server with a distributed and structured file system might suffice. Raw logs might have some disk utilization overhead but it is good to have raw logs in case you have to re-evaluate what happened in your systems or in case you have made a mistake in the next (event handling) step: if you have raw logs, you can always “go back to the original source” and review what the system originally logged about itself. If you aggregate your events while collecting them, you already neglect some of the events or aspects that would provide beneficial in later phases. Without this step you create an additional overhead when analysing logs.
3. User event idenfitication, log enrichment and event handling
You can get several events from log files that might not reveal too much on their own. To understand the actions behind log events, it might be important to correlate, aggregate or handle your log events in some other way. This is what I call the “normalization phase” or “creating a meaningful event” — this is what we are actually after. E.g. A successful login to your systems from a foreign country might not be something you’re interested in. However — if this event is preceded by 12 failed logins with different user accounts from the same country, then this might be something to look at.
This is one of the substantive parts of log analysis. Many Security information and event management (SIEM) solutions try to help you with their general solutions but unfortunately they’re not good enough: you might need something more tailor-made and universal SIEM solutions cannot provide what you’re looking for.
4. Notifications and alerting
It is good if you deal with log analysis but it is great, if you have automated reporting (e.g. notifications or alerts) that are coming your way when certain events are occurring. Not everybody have the ability to look at logs on a daily basis. However, you might wish to know if events are out of your baseline or thresholds are exceeded. So if you have quantitative thresholds that you’re looking for, automated alerts or dashboards could give you a quick overview on interesting events. Without this, you might miss the important activities in your systems.
It is important to keep the amount of alerts to a minimum. If you strain yourself with a load of alerts, the value of the notification will devaluate and in the end of the day you might not pay any attention to the alerts. They should be like weekends: rare enough to feel like celebrating but frequent enough to know that alerting system still works.
There are multiple solutions for alerting starting from your own custom e-mailer scripts to monitoring solutions (e.g. nagios or zabbix monitoring solutions) that trigger different actions on your behalf but in the end you should use a solution that fits to your working habbits in the best way. If you use Slack for every day communication then you might look into a solution that could send notifications to respective Slack channels etc.
5. Log analysis, incident handling and root cause identification
Although this step is mentioned as the last one, it is the most crucial one. The first thing to do is to create an understanding on the baseline of your systems. Simply put, try to understand what is normal and what are the usual problems in your systems. The overall picture of your systems reveals itself thanks to log analysis.
Spending time just reading your logs and analysing what you see gives you a lot of insight. When dealing with analysis, You can answer the first question: what do you need to log (point 1) in the first place. It also gives you answers on how long and what kind of events should you keep in the long term (point 2). Next, you’ll get input on the events that are actually happening: any kind of input for log event correlation assumes that you know what you are looking for (point 3). Then, the baseline (what is normal and what is not) for alerting can only come from the existing situation in your systems (point 4). And of course, the current point itself: you cannot handle incidents or find the root cause of problems if you don’t have the ability to analyse your log files quickly, effectively and in any log format you might have (also keeping in mind that the log format might change).
Finding flexible tools for this kind of analysis isn’t easy: most of the SIEM solutions and log parsers use pre-defined log event patterns which means that by the time you have to analyse the logs, developers have changed the log format so many times that no useful information is actually being saved to the database. Also there’s an issue of licensing the price based on data volume etc. A really good option I have found to be working is SpectX. Defining patterns for log formats on the go, accessing log events from different sources (e.g. flat files, databases, your s3 repository etc) and correlating them with each other gives you the flexibility of walking through your whole log repository in a matter of seconds. Together with a structured flat-file log server, it is a really powerful combination for getting a picture on the health of your organisation’s systems.
It might be that not all of the mentioned aspects are applicable in your scenario but make sure to think these topics through before starting to deal with logs.