Datastream.io scikit-learn integration

Mentat
Mentat Innovations
Published in
2 min readFeb 12, 2018

A few days ago we open-sourced our platform for anomaly detection in Python — you can read more about that here.

This post is focused on one feature of our framework: integration with scikit-learn. Sklearn is the flagship ML toolbox for Python, and growing by the day. To ignore their models and design patterns would be to reinvent the wheel.

So we have added a small example about how you can bring the full strength of scikit-learn to bear upon your detection problem, while still using dsio. Consider the following file, which you can find in the examples folder:

datastream.io/examples/lof_anomaly_detector.py

“Lof” stands for “Local Outlier Factor”, an old and well-tested technique for detecting anomalies in Euclidean space (although it can be generalised to any space for which you feel comfortable defining a distance metric). The basic idea of LOF is to identify points whose nearest neighbours are not so near, in comparison to other points in the dataset. This allows us to detect anomalies whose values might not look so abnormal when you compare them to the maximum and minimum values found in the dataset, but in truth they occupy an empty space somewhere in the middle, where no other data lives. For one-dimensional data it might be counterintuitive to imagine such anomalous gaps, but as the dimension of the data increases it becomes increasingly likely that your anomalies will not contain extreme values in all dimensions.

Scikit-learn contains an implementation of LOF, wonderfully explained here. However, the Sklearn framework itself does not contain an interface for anomaly detection: it only supports classification/regression (supervised learning) and clustering (unsupervised learning where the main objective is to assign datapoints to clusters, rather than to produce anomaly scores). This is by no means a crippling disadvantage: any clustering algorithm can be easily modified to produce an anomaly detector via ideas similar to LOF.

However, our proposed interface is a step forwards in recognising anomaly detection as a core data science problem category. We have followed sklearn design patterns in introducing it as a Mixin rather than an object, which means that you can use pretty much any class you want, as long as you introduce (or override if they exist already) the following methods:

fit, update, score_anomaly, flag_anomaly

Our fit function will be revised soon to follow sklearn conventions fully (currenty it only supports unidimensional input, so it diverges), so that you don’t have to worry about it. The score_anomaly function is probably the most important, as it produces the final output of the detectors, and the flag_anomaly function serves to produce a binary output if one is needed.

The update function

The update method present in AnomalyMixin is another key innovation of dsio whereby we request that all supported models feature a way to update their states given new data, rather than having to refit from scratch. We will be writing another post soon to delve deep into the world of model updates!

--

--