Datastream.io : Open Source Anomaly Detection

Mentat
Mentat Innovations
Published in
3 min readJan 30, 2018

We are proud to launch the very first version of our open-source project for Anomaly Detection and Behavioural Profiling on data-streams, datastream.io (dsio on github).

We have a long roadmap ahead of us, but, release often and release early, as they say. So here it is — a minimal viable full-stack Python anomaly detector:

pip install -e git+https://github.com/MentatInnovations/datastream.io#egg=dsio

Features

The purpose of the project is to perform the following functions:

  • Consume data from a variety of file and stream formats.
  • Transform data streams on the fly to derive statistics of interest such as aggregations, counts, sessions, groupings, or extract features.
  • Model the resulting stream via unsupervised machine learning to capture normal baseline behaviour either globally, or at the level of a device/user.
  • Score every new event by comparing it to the baseline model.
  • Visualise anomalous events on a lightweight customisable dashboard, with a lightweight back-end, involving minimal fuss by the user.

In the spirit of a minimal first release, we start by supporting consumption from CSV files, filtered by column, a couple of basic modelling and scoring options, followed by visualisation via an Elastic-Kibana solution involving a dashboard which is auto-generated in accordance to the column names.

Bring-your-own detector

Those of you that read our previous post know that we are about to unleash some pretty powerful anomaly detection models in this project. But like any open-source project, our main ambition is to create a platform. So for the first release, we have offered two basic example detectors (see below), as a template for you to build your own! All you need to do is support some basic interfaces, like a way to update your model, a way to train it from scratch (this addresses the cold start problem), and a way to detect anomalies, which often will often involve a threshold on a scoring function that numerically describes how likely each new event appears in comparison to the model.

You can try one of our own detectors from the command line like this:

dsio --detector gaussian1d examples/data/cardata_sample.csv

to run against a sample dataset comprising IoT measurements from a car. But if you’d like to write your own, just add your module and run instead:

dso --modules examples/detector.py --detector Percentile1D examples/data/cardata_sample.csv

Here is the result:

We are looking forward to your feedback and contributions! We will be adding exciting contributions from our friends and colleagues in UK academia and industrial partners.

datastream.io in action
datastream.io Kibana dashboard

--

--