Absynthe: A (branching) Behaviour Synthesiser
Generating sophisticated, labelled data for log analysis at scale
Machine Learning community has started addressing an omnipresent engineering problem, but gathering high-quality, labelled training data has been a challenge. Until now.
Monitoring and troubleshooting deployments of production systems are major challenges for IT Operations and Site Reliability Engineers in modern enterprises. It is not unusual for companies to rely on some variant of the ELK (i.e. Elasticsearch-Logstash-Kibana) as the mainstay of their ops.
Although ELK is an excellent setup for indexing, searching, and visualising, it leaves a lot to be desired in terms of identifying and fixing the root causes of problems. This is because this setup does not provide any intelligence or insights regarding what's normal and what's anomalous; and troubleshooting through manual search is cumbersome for many reasons.
- The volume of log messages is enormous.
- There could be multiple applications or multiple modules pumping their logs into single log stream, implying that consecutive log messages could come from unrelated sources.
- Session IDs or execution IDs, which help demarcate execution sequences, might not be present.
- SREs don't necessarily have access to source code or product documentation.
- Even the presence of keywords like ERROR, WARNING, or FATAL might not necessarily indicate problems; they could simply be artefacts of peculiar deployment configurations.
Unsurprisingly, machine learning has a lot to offer when it comes to analysing logs for modelling behaviours of production systems and identifying anomalous behaviours. However, one of the challenges in hunting an ML based solution for log analysis is the availability of "labelled data".
It is possible to turn on logging on personal computers and obtain gigabytes of logs like those in the image above. However, training and evaluation of classifier models require labelled training data at scale. Hand labelling of logs is an ungainly and error ridden process, especially since training a generalisable model requires training data that include different kinds of application flows.
This is the motivation behind Absynthe, an open source library that can generate arbitrary quantities of labelled data for training and evaluating ML algorithms for log analysis.
Absynthe models each application module as a control flow graph (a CFG), as illustrated in the image below. This representation is analogous to that of code. Each node in the CFG stands for a statement. Some statements induce branching while others are convergence points for different branches. A single execution of a module is captured by a single traversal of the CFG from a root to a leaf.
It is possible to generate different kinds and sizes of CFGs at random by specifying a number of parameters, viz., number of roots and leaves, number of internal nodes, and average branching degree. It is possible to generate multiple CFGs at any given time, traverse them simultaneously, and "interleave" the logs that the nodes emit. In this manner, Absynthe can generate logs that simulate complex, life-like situations.
Log messages generated in this manner can be treated as labelled data in the following sense. Each traversal has a unique session ID associated with it; that is, all log messages simulating a single execution of a module bear the same session ID. The session ID is followed by graph ID in the log message. And each session starts at a root node and ends at a leaf node of the CFG. Moreover, the CFG itself can be treated as the "ground truth" model that generates all the logs.
This is a work in progress and Absynthe is in version 0.0.1 right now. Some of the current and planned functionalities are as follows.
- During a CFG traversal, each node chooses its successor uniformly at random. It is possible to implement different kinds of nodes that exploit other parametric probability distributions to select their successors. A single CFG can have nodes of all these different types.
- Each node emits a log message that is a repetition of its unique node ID. Each node currently sets the number of repetitions at random. It is possible to implement nodes that emit more life-like log messages.
- Simple, tree-like CFGs can be built at present. But it is possible to generate more complex CFGs that have loops and whose edges skip levels.
- The interleaving of CFG traversals is "monospaced" right now. This means that the time difference between successive log lines is constant, regardless of which CFGs these log lines come from. It is possible to extend the implementation to fine-tune the interarrival time of log lines to individual CFGs.
Update Jun 25, 2019: This article was written alongside the release of v0.0.1 and is now out of date. For latest features of Absynthe, check the README.