Log File reduction with TFIDF ranking

And outlier/anomaly detection using the rank and time with TimeSeries Fitting

Published in

Better ML

4 min readAug 31, 2023

Summary

Using TFIDF ( Term Frequency, Inverse Document Frequency) Vectorize to find the terms or logs that are least common. To explain the concept, lets take a snippet from a modified sys log file Log File Snippet

Here we can see that certain logs occur only a few times in the document and the majority of logs are periodic. It could be that the problem is in the few logs that do not keep repeating; for example, an IO driver that logs about a possible disk error. It could also be that there is an inherent problem in the system and the periodic logs are generated due to it. Practically we observe that many of the logs that repeat in a commercial non-critical SW system can be ignored.

Our goal is a system that can suppress the majority of periodically repeating logs; but is able to highlight the in-frequent logs.

So we have two aspects here, the time pattern and the frequency of the logs itself. Let’s take each of these terms.

Log Frequency

We need to use a ranking where the most repeated similar logs have similar rankings and the least repeated logs are outliers.

We need a similarity match and not an exact match as logs that denote even the same event can be dissimilar in some numerical data between them. An example below

There are many text similarity algorithms for this https://stackoverflow.com/questions/17388213/find-the-similarity-metric-between-two-strings

An example could be the Jaccard Index, Cosine Similarity or the Jaro-Wrinler distance For some of these, the edit distance between two sequences forms the basis of the similarity algorithm; for others, we vectorise — that is, convert each word in a sentence to a numerical representation and then use some similarity index of the resulting vectors to compare two sentences. The simplest is Cosine Similarity.

However, these similarity algorithms are not very accurate.

We use the Scikit Learn Text Vectorize library and the TfidfVectorizer. It first vectorizes the words of the documents. Then it gives a higher rank to terms that are less repeated in the document over others.

A log will have a lot of words. Many of which will be common. So once we vectorize each sentence, for each vectorized sentence, we select the word with the highest score as the representative score for the log line.

Illustration below

Log rows

Mar 31 09:31:48 1compx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:48.659318010 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11855 (rc: 32)

Mar 31 09:31:48 2compx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:48.751333697 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11856 (rc: 32)3DDDDcompx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:49.888424524 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11864 (rc: 32)Mar 31 09:31:48 4compx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:48.783381048 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11857 (rc: 32)Mar 31 09:31:48 5compx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:48.826483871 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11858 (rc: 32)Mar 31 09:31:49 compx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:49.156622971 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11859 (rc: 32)

The TFIDF-based score for the log lines above (last column)

We can see that only the last line has terms that are repeated most frequently in the entire document and get the lowest score. Whereas the other logs have terms like compx-** which is unique in the entire document and gets the highest score

TFID Score

Snapshot here

Output — https://gist.github.com/alexcpn/07e40d4bb46397632f83ffdc0362e9bb#file-tfidfrank-csv

Time Pattern

Once we have the TFIDF rank, we plot this number against the log time. We use the Time series forecasting tool Prophet to fit the TFIDF rank against the time. Why we are doing this is that the Prophet has a lower and upper threshold band in which it tries to fit the observations. This means that if there is a time pattern in the log; that is if similar logs are repeating in similar time intervals, they are fitted inside the upper and lower band. Prophet uses this band to plot a trend for use in forecasting. However, we can use the bar to fit expected logs inside it; and only those logs that are outside the band we threshold out.

Final Sample Output

Not for the graph above, which is a proper syslog file, while below is a small snippet of Syslog and is just illustrative in how

Also here https://gist.github.com/alexcpn/07e40d4bb46397632f83ffdc0362e9bb#file-output-csv

Mar 31 09:31:48 2compx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:48.751333697 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11856 (rc: 32)
Mar 31 09:31:48 3DDDDcompx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:49.888424524 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11864 (rc: 32)
Mar 31 09:31:48 4compx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:48.783381048 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11857 (rc: 32)
Mar 31 09:31:48 5compx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:48.826483871 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11858 (rc: 32)
Mar 31 09:31:49 compx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:49.441979340 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11860 (rc: 32)
Mar 31 09:31:49 compx-f476c9ffb-8gtc4 ns-slapd[2762]: [31/Mar/2022:09:31:49.501112749 +0000] DSRetroclPlugin - delete_changerecord: could not delete change record 11861 (rc: 32)

Implementation

https://github.com/alexcpn/aiops_logreduction