Predicting system behavior and anticipating anomalies using python and machine learning

Iowasandroid
4 min readMar 12, 2022

--

ML and AI has been lot in the news as of late, and what better time to deep dive than now.

Story of my life!!!

Problem statement

I asked myself, “Is there a way to anticipate when servers will experience downtime ahead or when the environment will break down? Or to be more precise, Is there a way I can gauge and anticipate when the next memory leak or CPU choke or stuck thread, that has the potential to render the environment offline, will occur? If yes, can I build an automaton, that can alert me of an impending doom, determine the cause, then self heal and thus avoid the event of the environment going down?”. Of course not being too over ambitious, no doubt this sounds hefty, but I myself had to see where this train of thought carries, and decided to explore and give a try.

Benefits

Once this automaton is in place, your system will seldom go offline. Statistically speaking, the environment availability will provide an estimate guarantee of 85% which is more than efficient enough to improve stability, increase productivity and boost business.

Skills Required

Machine Learning — Supervised/Unsupervised algorithms
Evaluation methods
Machine Learning Libraries: Python (numpy, sci-kit, pandas, tensorflow)

Target audience

Padawan or beginners, Jedi knights or intermediates. (insert any pop culture words here to sound nerdy!!!).

Design

There is always more than one way to solve a problem, but the simplest way is always the best most of the times, and optimal. I decided to split the problem into 2 parts. First, is the anomaly detector. Second is the system behavior predictor. I will take some time and explain the 2 parts and what problem they address and how they do it.

Anomaly detector - As the word intrinsically implies, the aim of this tool, is to catch anomalies, specifically in the log files. Anomalistic entries in log files are potential disruptors that tend to bring the environment down. The script in place, will have to read the log file in real time to catch the anomaly red handed. I chose Python because why not, and I am huge fan of it. I targeted the application server logs first and tailored the code to catch log lines which read like “NullPointerException” or “FileNotFound” errors etc. I captured the timestamp at which the anomalies were recorded so as to keep a “memory” of events, so as to avoid rerunning through the same lines in case if the script stopped running or in the event if the script crashed. Once an anomalous event was detected, the script will check its records to see if there is a match and will see the corresponding action of resolution, and will act on it and thus fix the issue. The action could be events like clearing cache, rebooting the servers, redeploying the application etc. Its impossible to record and address all the anomalies and the corresponding resolution, but keep in mind that 80% of the issues are caused by 20% of the problems(Pareto principle, anyone?). So once you address the 20% of the problems, you have achieved environment availability significantly.

System behavior predictor - This part anticipates the system behavior before the anomalous event will occur. In order to achieve that, I relied on machine learning. The code at hand, will consume historical data of memory consumption, disk space usage, CPU time etc., and feed it to a script. The nature of the ML algorithm is linear regression and it consumes the trove of data, splits it 80/20, trains using the 80, and tests with the 20 to check if the prediction matches the expected outcome. I settled on 90% accuracy threshold and ran multiple epochs if the accuracy ever siphons down or not up to the expected mark. Once the code is in place, I ran a test run against a server and see how it behaves for couple of days. The accuracy hovered somewhere around 83% which was not the mark I initially aimed at, and hence I reran the code to train the model. Remember, you can always go back and run the code to train the model, so as to match the expected accuracy. This is just a starter, remember I have only taken into account, metrics such as memory, RAM, CPU. There are other metrics which can be considered like thread usage count, file descriptors count etc.

Final thoughts

Its always fun to explore “What if?”, and this was one of them. With the above 2 systems in place, the environment is guaranteed to be online at least most of the time. This is my first tech blog, and super excited. Watch out for more write ups here.

--

--