AI model to predict System operation failures using incremental log analysis using the fast.ai libraries

Vinay Rao
Analytics Vidhya
Published in
4 min readOct 9, 2019
Photo by Safar Safarov on Unsplash

Problem statement:

Can I train a NLP based model to read service logs and predict a potential outcome of the operation (Pass/Fail) before the operation completes?

Abstract:

Given that the logs generated by any operation is our point of reference in order to troubleshoot any issue, I wanted to explore possible NLP solutions, where I can train a model, to continuously monitor logs generated by the process and “understand” the logging patterns to estimate the overall probability of the operation going through vs failing in real time.

I used the fast.ai framework to build a language model first to build a “process-centric” vocabulary. This was necessary because, for my model to “make sense” of Pythonic expressions like folder path “/opt/” or the word “Python” and not confuse them for “choice” and “Snake” respectively, a language model was necessary.

Once the language model was built, I used that vocabulary to build a classifier using the weights provided by a pre-trained AWD-LSTM based model (Therefore consuming the concepts of Transfer learning).

The training set was a pre-processed list of system-update (operation) logs from various pass and fail scenarios.

The test set was a different set of pre-processed list of system-update logs that the model had not seen before.

The exercise was to feed in this test set and compare the outcomes of its overall prediction to the actual outcome.

Details:

The model is able to predict the probability of the system update operation’s pass/fail status based on incremental log file information (As system-update progresses).

With some more training, I think we should be able to generate better predictions and capture a failure prediction much earlier in the process cycle..

I passed 10%, 30%, 50%, 75%, 95%, 99% and 100% of the log files for 5 samples from the test set and noticed these failure predictions.

X-axis: Percentage of the log file fed to the model

Y-Axis: Failure probability

As you can see, initially, i.e. at around 10% (X-axis), in all 5 cases, the failure probability is close to 50% (Y-axis), which is understandable as the model does not have enough information to predict the eventual outcome one way or the other.

In cases 1,2 and 3, the actual outcome was “fail” and the model did predict a failure. The interesting point for me was that, the failure trend was noticed at around the 50% data (X-axis) mark, which suggests that, after looking at the logs generated at 50% of the system update progress, the model did a reasonable job in plotting a failure trend.

Case #4 was a mis-classification. Things were seemingly heading in the right direction up until 99% of the data and for some reason, it changed its mind and decided to classify it as a “fail” scenario.

Case #5, was for a log file from a successful system update operation. As you can see, the failure probabilities are constantly in the <40% (Y-axis) chance range.

Steps involved:

  1. Pre-process your log files to convert an entire file into a string
  2. Generate a Pandas dataframe with 2 columns: Log (X) and Status (Y)

3. Create a test and train set for the same

4. Generate a Language model using the fast.ai libraries that will help create a language specific vocabulary.

5. Use this vocabulary to create a TextDataBunch that can then be used to train a classifier

6. At this point, we go ahead and train our models:

A few interesting observations:

1. The amount of data that I needed in order to generate these results was fairly minimal.

2. The base language model that I built and consumed the vocabulary from, provided a good base for me to build the model on top of. Given my failed attempts at using tokenisers like Word2Vec and TF-IDF which are built on English based articles, their contextual relevance to our system logs was low. (For example, words similar to “Python” in word2vec are Snakes and other reptiles and what we want is more system-specific).

Summary:

We should be able to build similar models with better accuracy for various critical operations and perform these tasks:

  1. Generate relevant alerts early in the cycle
  2. Take corrective measures either automatically or manually to fix the issue

References:

  1. A lot of good work done by Gaurav Arora. One amongst many that I used as a reference to get familiar with the fast.ai libraries: https://github.com/goru001/nlp-for-kannada/blob/master/classification/Kannada_Classification_Model.ipynb
  2. https://medium.com/datadriveninvestor/deep-learning-with-python-and-fast-ai-part-2-nlp-classification-with-transfer-learning-e7aaf7514e04
  3. https://www.fast.ai/

--

--