Absynthe: A (branching) Behaviour Synthesiser

Generating sophisticated, labelled data for log analysis at scale

Machine Learning community has started addressing an omnipresent engineering problem, but gathering high-quality, labelled training data has been a challenge. Until now.

Monitoring and troubleshooting deployments of production systems are major challenges for IT Operations and Site Reliability Engineers in modern enterprises. It is not unusual for companies to rely on some variant of the ELK (i.e. Elasticsearch-Logstash-Kibana) as the mainstay of their ops.

Although ELK is an excellent setup for indexing, searching, and visualising, it leaves a lot to be desired in terms of identifying and fixing the root causes of problems. This is because this setup does not provide any intelligence or insights regarding what's normal and what's anomalous; and troubleshooting through manual search is cumbersome for many reasons.

  1. The volume of log messages is enormous.
Multiple applications could be dumping logs messages to the same stream at different rates

Unsurprisingly, machine learning has a lot to offer when it comes to analysing logs for modelling behaviours of production systems and identifying anomalous behaviours. However, one of the challenges in hunting an ML based solution for log analysis is the availability of "labelled data".

It is possible to turn on logging on personal computers and obtain gigabytes of logs like those in the image above. However, training and evaluation of classifier models require labelled training data at scale. Hand labelling of logs is an ungainly and error ridden process, especially since training a generalisable model requires training data that include different kinds of application flows.

This is the motivation behind Absynthe, an open source library that can generate arbitrary quantities of labelled data for training and evaluating ML algorithms for log analysis.

Absynthe models each application module as a control flow graph (a CFG), as illustrated in the image below. This representation is analogous to that of code. Each node in the CFG stands for a statement. Some statements induce branching while others are convergence points for different branches. A single execution of a module is captured by a single traversal of the CFG from a root to a leaf.

A simple, "tree-like" CFG without loops and skip-level edges.

It is possible to generate different kinds and sizes of CFGs at random by specifying a number of parameters, viz., number of roots and leaves, number of internal nodes, and average branching degree. It is possible to generate multiple CFGs at any given time, traverse them simultaneously, and "interleave" the logs that the nodes emit. In this manner, Absynthe can generate logs that simulate complex, life-like situations.

Log messages generated by Absynthe.

Log messages generated in this manner can be treated as labelled data in the following sense. Each traversal has a unique session ID associated with it; that is, all log messages simulating a single execution of a module bear the same session ID. The session ID is followed by graph ID in the log message. And each session starts at a root node and ends at a leaf node of the CFG. Moreover, the CFG itself can be treated as the "ground truth" model that generates all the logs.

This is a work in progress and Absynthe is in version 0.0.1 right now. Some of the current and planned functionalities are as follows.

  1. During a CFG traversal, each node chooses its successor uniformly at random. It is possible to implement different kinds of nodes that exploit other parametric probability distributions to select their successors. A single CFG can have nodes of all these different types.

Update Jun 25, 2019: This article was written alongside the release of v0.0.1 and is now out of date. For latest features of Absynthe, check the README.

Please consider contributing to the project by helping implement any of the functionalities mentioned above or anything else that you fancy. In the mean time, your feedback is most welcome, here or on its GitHub page.

PhD in theoretical computer science. Practitioner and student of AI and ML. Also interested in science, history of science, education, and startups.