Machine Learning in Monitoring is BS
Much has been made of the impact of big data and machine learning techniques on a wide variety of businesses. Indeed, amazing things are being accomplished in fields that previously have been solely the domain of humans: image classification, optical character recognition, even driving cars. However, despite the claims of countless vendors, machine learning has yet to pay off big in the monitoring field. To understand why, we need to dig much deeper into the machine learning techniques available today as well as the current state of the monitoring industry.
The monitoring world today is primarily oriented around the collection, storage and visualization of time series metrics data. This telemetry is familiar, easy to collect, and easy to graph. Unfortunately it is not easy to alert on. At its core, alerting is an exercise in modeling. A model of the system, be it static or dynamic, is built to give predictions on expected system behavior. When that behavior deviates sufficiently from the model an alert is generated and a human operator is summoned to deal with the abberant behavior. At first glance this seems like a great use for machine learning techniques, which are all about training models to classify datasets. However, machine learning has yet to work in the general case and instead has found some success in niche use cases.
There are a number of contributing factors to why ML hasn’t taken over monitoring:
- Ambiguity of the data — Most platforms come with a standard set of metrics out of the box. These mainly focus around resource usage and generally speaking are not very valuable for spotting user facing issues. The metrics that are manually instrumented from within application code are usually much more useful for spotting a fault, however they are often added after the fact and their interpretation is very specific to the application in question. Reading graphs of metrics to ascertain system health is like trying to read an EKG. Is this bump on the graph good, or is it bad? Does it require intervention, and if so what kind? It takes years for an intelligent human being to build up the experience to divine those kinds of insights from a graph, and even then it’s an art not a science.
- Multiplicity of variables — Machine learning excels at tasks that do not require massive amounts of context. Expecting machine learning techniques to alert on a mass of metrics is like asking a vet if your farm doing ok by faxing them unlabeled EKG’s for all of your animals. The data requires a massive amount of context in order to analyze properly: which application did it come from, what was the level of traffic, what language is it written in, etc. Even worse, modern systems are constantly under change and the performance characteristics of any one system may change drastically from one release to the next. As a cheap experiment in context free anomaly detection, one could conceivably build an alerting system by asking mechanical turkers to classify graphs as anomalous or healthy. Would you want to be the one on call for such a system?
- User experience — Machine learning is a black box. The internals of a model are as resistant to analysis as any other stochastically built system. Waking up a developer in the middle of the night to fix a problem is already a bad experience, woe be to the alert that wakes up that developer in error without a discernible explanation.
Solving these problems in such a brute force way is basically equivalent to artificial general intelligence. If your ML algorithm can replace a DevOps professional, then it can replace any number of other highly skilled professions en masse. Which is not to say that the problem of ML assisted alerting is intractable. Rather, starting from a dataset oriented around visualization makes it so. I believe that monitoring companies which eschew visualization based workflows will naturally wind up with datasets more amenable to machine learning techniques.
At Opsee we’re focused on making monitoring in AWS simple for developers and taking away the pain of being on call. If you want to try us out during our private beta, please sign up and take our survey.
(thanks to Ben Linsay and ts waterman for feedback)