Log File reduction with TFIDF ranking
And outlier/anomaly detection using the rank and time with TimeSeries Fitting
Summary
Using TFIDF ( Term Frequency, Inverse Document Frequency) Vectorize to find the terms or logs that are least common. To explain the concept, lets take a snippet from a modified sys log file Log File Snippet
Here we can see that certain logs occur only a few times in the document and the majority of logs are periodic. It could be that the problem is in the few logs that do not keep repeating; for example, an IO driver that logs about a possible disk error. It could also be that there is an inherent problem in the system and the periodic logs are generated due to it. Practically we observe that many of the logs that repeat in a commercial non-critical SW system can be ignored.
Our goal is a system that can suppress the majority of periodically repeating logs; but is able to highlight the in-frequent logs.
So we have two aspects here, the time pattern and the frequency of the logs itself. Let’s take each of these terms.
Log Frequency
We need to use a ranking where the most repeated similar logs have similar rankings and the least repeated logs are outliers.
We need a similarity match and not an exact match as logs that denote even the same event can be dissimilar in some numerical data between them. An example below
There are many text similarity algorithms for this https://stackoverflow.com/questions/17388213/find-the-similarity-metric-between-two-strings
An example could be the Jaccard Index, Cosine Similarity or the Jaro-Wrinler distance For some of these, the edit distance between two sequences forms the basis of the similarity algorithm; for others, we vectorise — that is, convert each word in a sentence to a numerical representation and then use some similarity index of the resulting vectors to compare two sentences. The simplest is Cosine Similarity.
However, these similarity algorithms are not very accurate.
We use the Scikit Learn Text Vectorize library and the TfidfVectorizer. It first vectorizes the words of the documents. Then it gives a higher rank to terms that are less repeated in the document over others.
A log will have a lot of words. Many of which will be common. So once we vectorize each sentence, for each vectorized sentence, we select the word with the highest score as the representative score for the log line.
Illustration below
Log rows
Mar 31 09:31:48 1compx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:48.659318010 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11855 (rc: 32)
Mar 31 09:31:48 2compx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:48.751333697 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11856 (rc: 32)3DDDDcompx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:49.888424524 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11864 (rc: 32)Mar 31 09:31:48 4compx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:48.783381048 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11857 (rc: 32)Mar 31 09:31:48 5compx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:48.826483871 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11858 (rc: 32)Mar 31 09:31:49 compx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:49.156622971 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11859 (rc: 32)
The TFIDF-based score for the log lines above (last column)
We can see that only the last line has terms that are repeated most frequently in the entire document and get the lowest score. Whereas the other logs have terms like compx-** which is unique in the entire document and gets the highest score
TFID Score
Snapshot here
Output — https://gist.github.com/alexcpn/07e40d4bb46397632f83ffdc0362e9bb#file-tfidfrank-csv
Time Pattern
Once we have the TFIDF rank, we plot this number against the log time. We use the Time series forecasting tool Prophet to fit the TFIDF rank against the time. Why we are doing this is that the Prophet has a lower and upper threshold band in which it tries to fit the observations. This means that if there is a time pattern in the log; that is if similar logs are repeating in similar time intervals, they are fitted inside the upper and lower band. Prophet uses this band to plot a trend for use in forecasting. However, we can use the bar to fit expected logs inside it; and only those logs that are outside the band we threshold out.
Final Sample Output
Not for the graph above, which is a proper syslog file, while below is a small snippet of Syslog and is just illustrative in how
Also here https://gist.github.com/alexcpn/07e40d4bb46397632f83ffdc0362e9bb#file-output-csv
Mar 31 09:31:48 2compx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:48.751333697 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11856 (rc: 32)
Mar 31 09:31:48 3DDDDcompx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:49.888424524 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11864 (rc: 32)
Mar 31 09:31:48 4compx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:48.783381048 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11857 (rc: 32)
Mar 31 09:31:48 5compx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:48.826483871 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11858 (rc: 32)
Mar 31 09:31:49 compx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:49.441979340 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11860 (rc: 32)
Mar 31 09:31:49 compx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:49.501112749 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11861 (rc: 32)