Datastream.io : Open Source Anomaly Detection
We are proud to launch the very first version of our open-source project for Anomaly Detection and Behavioural Profiling on data-streams, datastream.io (dsio on github).
We have a long roadmap ahead of us, but, release often and release early, as they say. So here it is — a minimal viable full-stack Python anomaly detector:
pip install -e git+https://github.com/MentatInnovations/datastream.io#egg=dsio
Features
The purpose of the project is to perform the following functions:
- Consume data from a variety of file and stream formats.
- Transform data streams on the fly to derive statistics of interest such as aggregations, counts, sessions, groupings, or extract features.
- Model the resulting stream via unsupervised machine learning to capture normal baseline behaviour either globally, or at the level of a device/user.
- Score every new event by comparing it to the baseline model.
- Visualise anomalous events on a lightweight customisable dashboard, with a lightweight back-end, involving minimal fuss by the user.
In the spirit of a minimal first release, we start by supporting consumption from CSV files, filtered by column, a couple of basic modelling and scoring options, followed by visualisation via an Elastic-Kibana solution involving a dashboard which is auto-generated in accordance to the column names.
Bring-your-own detector
Those of you that read our previous post know that we are about to unleash some pretty powerful anomaly detection models in this project. But like any open-source project, our main ambition is to create a platform. So for the first release, we have offered two basic example detectors (see below), as a template for you to build your own! All you need to do is support some basic interfaces, like a way to update your model, a way to train it from scratch (this addresses the cold start problem), and a way to detect anomalies, which often will often involve a threshold on a scoring function that numerically describes how likely each new event appears in comparison to the model.
You can try one of our own detectors from the command line like this:
dsio --detector gaussian1d examples/data/cardata_sample.csv
to run against a sample dataset comprising IoT measurements from a car. But if you’d like to write your own, just add your module and run instead:
dso --modules examples/detector.py --detector Percentile1D examples/data/cardata_sample.csv
Here is the result:
We are looking forward to your feedback and contributions! We will be adding exciting contributions from our friends and colleagues in UK academia and industrial partners.