5 things we learned from auto clustering log records
When we founded Coralogix, one of the main goals we had was what we call “Make Big Data small”, the idea was to allow our users to view a narrow set of unique log entries instead of just indexing terabytes of data and allowing textual search. We knew this could be valuable, but our prior assumptions as to how log data varies were quite far from reality. I wanted to share 5 interesting facts with you so we can discuss them in the comments if you like.
One definition I have to make before we start is “Log Template”. What we call a log template is basically similar to the printf you have in your code, which means it contains the log words split to constants and variables.
For instance if I have these 3 log entries:
- User Ariel logged in at 13:24:45 from IP 220.127.116.11
- User James logged in at 12:44:12 from IP 18.104.22.168
- User John logged in at 09:55:27 from IP 22.214.171.124
Then the template for these entries would be:
- User * logged in at * from *
And I can say that this template arrived 3 times.
Now that we are on the same page, here’s what we’ve learned from clustering daily Terabytes of log data for over 50 customers :
1) 70% of your log data is generated by just 5 different log templates. This shows how our software actually has one main flow which is used over and over while other features and capabilities are rarely in use.
2) Over 90% of the queries ran by users are on the top 5 templates. These statistics show us how we are so blinded by these templates dominance we simply ignore other events.
3) 97% of your exceptions are generated by less than 3% of the exceptions you have in your code. You know these “known errors” that always arrive ? they are blinding you from seeing the real errors on your system.
4) 0.7% of your templates are of level Error and Critical, and they generate 0.025% of your traffic. This demonstrates just how easy it is to miss these errors, not to mention that most of them are generated by the same exceptions.
5) Templates that arrive less than 10 times a day are almost never queried (1 query every 20 days in average by all 50 customers together!). This is an amazing detail that shows how companies keep missing those rare errors and only encounter them once they become a widespread problem.
The facts above show how our current approach towards logging is very much affected by the log variance and not by our perspective. We react to our data instead of proactively analyzing it according to our needs because the masses of data are so overwhelming we can’t see past them.
By automatically clustering log data back to its original structure, we allow our users to view all of their log data in a fast and simple way and quickly identify suspicious events that they might ignore otherwise.