Universal log (auto)parsing — Log analysis with PacketAI (Part 2)

Published in

PacketAI

6 min readJan 13, 2022

TL;DR

Logs are valuable information describing runtime events. However, because of the sheer amount of logs generated by large online services, it is impossible to get any insights from them in real time using traditional observability tools. This is why at PacketAI, we built a real-time log analysis engine based on text mining and machine learning algorithms.

Since logs are mainly unstructured textual data, the first step towards automated analysis of logs is the “representation phase”, where raw log lines are transformed into structured data ready for downstream tasks such as anomaly detection. In this article, we present our design for Log2template, a universal log parsing engine that is agnostic to log formats. Log2template at its core, is based on word embedding using neural networks.

Introduction

A log message is mostly an unstructured (sometimes semi-structured) line of text printed by logging statements (e.g., printf()) defined by developers. Large-scale services usually generate millions of logs, which describe a vast range of events that record service runtime information. Therefore, they are crucial for service management tasks, such as resource allocation, scheduling, troubleshooting etc.

In the previous article of this series, we have presented the high level design of our AI-based log (auto)analysis engine. We touched upon the subject of compressing raw logs into log templates (template extraction) for better representation and ease of analysis, then we briefly run through different types of log anomalies.

PacketAI Log Anomaly Detection framework

In the following sections, we will dig a bit deeper into the first step, i.e extracting log templates from raw logs (yellow elements in the diagram above). We will first motivate this step, discuss state of the art tools and techniques, then present “the PacketAI way”.

Current observability practices don’t fully exploit logs

As mentioned before, logs are super important because they record valuable system runtime information. This information describes the inner workings of the system as opposed to outer working measures described by metrics. Logging however is widely used offline for postmortem analysis and auditing. The main reason for this offline instead of online usage is the unfeasibility to process huge log volumes with existing descriptive tools, such as search engines like Elasticsearch, Datadog, Loki ..

Typically, a log analysis tool provides parsing modules, a search engine and dashboards. This is great for postmortem investigations and offline auditing. Usually these tools employ regex-based parsing to extract key value fields then index them for quick search when needed. A few products go a step further by providing clustering of logs into templates (or patterns), like the paid version of ELK (x-pack modules) and Datadog.

The major limitation of these clustering methods is that they also require regex, first, to extract log formats and extract the log content, and second, to cluster the log content itself. As a consequence, these methods work acceptably fine on standard logs from standard sources like common databases and operating systems, where log formats are limited, open source and well understood. However, they either don’t work at all or perform poorly on custom logs generated by custom applications. We do know this because we have built these methods into our product, and have a long experience maintaining them for our clients. Register here to try it out.

Unless the log parser (we use the term to signify log transformation from unstructured to structured data) is agnostic to log formats, there will be always a need to manually create and maintain log formatting rules (regex), which is tedious and prone to human error. Even worse, there will always be a need to go back and forth with users to determine log formats and update them in case of change.

Log2template: A universal log parser

Introducing Log2template

Our vision was to to build a universal log parser that takes in raw logs of any arbitrary format, and outputs a set of log clusters (templates) based on a defined similarity metric. The system must not need any additional regex rules.

To build such a system we experimented with multiple techniques and finally settled down with a simple, proprietary neural network architecture, that is in turn based on word2vec embedding. The idea here is to map each log line into a vector in a multi-dimensional vector space where similar log lines share the same vector representation.

The input of Log2template is a log message, with or without header. In case the header is kept, we call it a raw log line. Note that removing the header from the input of this step enables better results since the embedding engine would work on clean, human readable log content.

High level design

The Log2template system works in two stages:

Offline stage (training phase)

During this phase, the system is trained on logs collected over a pre-defined period of time. At PacketAI, we have a process that determines this period of time based on log volumes, quality and other business requirements. The workflow is as follow:

historical logs data set is split into training and test
a word embedding neural network model is trained on the training set then test set
test results are evaluated against a template extraction heuristic, amplified with expert feedback. Model weights are adjusted accordingly
when a desirable result quality is achieved, the weights representing the chosen model, are stored into a weights database
and corresponding resulted templates are stored into a log template database

Online stage (inference phase)

In this phase, the trained model is put online for real-time inference. Since the model is designed to work on huge volumes of logs in real-time, significant amount of work has gone into scaling them and making sure the deployment is fast, auto-scalable and fault tolerant. The low-level deployment architecture is beyond the scope of this article though. The workflow of this phase is as follow:

real-time stream of logs is ingested and the saved word embedding model is applied
the output embedding vectors are compared to saved templates, if there is a match then the log line is assigned to the corresponding template and the tuple is output
if an Out Of Vocabulary (OOV) vector is found (vector that does not correspond to any template), then further heuristic analysis is done for confirmation
if the OOV is confirmed, then the corresponding template is saved into the templates database

The output of Log2template is then presented to the user on the PacketAI logs UI, you can give it a try in less than 5 min by registering here:

Conclusion

In this article, we presented our design for Log2template, a universal log parsing engine that is agnostic to log formats. Log2template at its core, is based on word embedding using neural networks, and solves multiple log representation problems found in traditional observability tools, including:

manual creation and maintainability of regex
lack of semantic representations
high computation cost of regex-based systems

In the next part of this series, we will dig deeper into one of the anomaly detection methods we apply on logs, which is based on the Log2template representations.