Logs, Models and Metrics — How and why they need to be connected

Deploying applications with less bugs, in less time and with higher availability at a lower cost (DevOps philosophy). This has become an important business strategy for companies who are trying to get ahead or stay ahead of the competition. Measuring and monitoring faster with higher fidelity results in better management of business-critical services.

Marrying log data with metrics and configuration data has been seen by some in the IT Operations world as the Holy Grail to proactive monitoring, real time root cause analysis and higher quality service delivery at a lower cost.

Logs are often seen as the last defense when it comes to figuring out what went wrong. They are often revered as an immutable testimony as to what happened and when.

Each log is like the page of a book. When woven together, they can tell a story about what may happen and why.

Metrics tell the story about how the application is performing, how the server the app is running on is doing; but on occasion you may miss a metric or you may average a series of metrics for performance reporting reasons and lose fidelity along the way. This is perfectly acceptable, on the premise you have a snapshot of the event somewhere, if there is an issue.

If you mix logs and metrics together, you can use metrics to zero in on logs of interest for high fidelity root cause, i.e. high cpu utilization on your content delivery app may be caused by some underlying application issue such as a problem with handling HTTP requests — which would be in the application logs. On the flip side, you can use logs to zero in on possible or current performance issues. If you are seeing a bunch of errors in your logs, it may be caused by over-provisioned hardware (due to capacity constraints or a badly behaving application) which is often cited as a leading cause of service degradation/downtime.

Configuration data tells us what, when, who and why something changed. It’s often the missing piece to the puzzle when trying to figure why something went wrong.

So ideally, you want to connect these three disparate data sources together.

A model or schema that connects logs with metrics and config is ideal; however, you can use other techniques which are less costly to implement, though also less precise. I’ll cover both approaches and their tradeoffs later in the post.

Mashing up logs, metrics and config with a model makes a lot of sense but it isn’t very straightforward. Many have tried and failed in this endeavour to realize the benefits of mashing up these disparate data sources.

There are many reasons as to why. Some of the key ones are:

  1. Scale — Not able to deal with the volume of incoming data (metrics and/or log)
  2. Model — Static or stale models don’t reflect reality (out of date and not 100% correct). For example, if the data you are mashing up doesn’t represent the current truth, users will lose faith and abandon the solution.
  3. Complexity — Hard to setup and maintain, often requiring a rocket science degree to keep things going.
  4. Expensive — Encapsulating the points above, there are some companies that have built their own home-grown solutions and they work — with a substantial investment of resources, time and money.

Note: For this post we’ll be using Graylog as the log management tool for the graph tutorial, since it has Neo4j support and a built in random message generator.

If you want to jump straight in, check out the tutorial and the Neo4j plugin here.

So what can we do?

I am not going to try and achieve the Holy Grail in a single post, but I will provide a starting point so curious minds who are looking for the ultimate composed solution may get a little closer it.

Let’s start with Log + Model. The reason we want to start here is because a model allows connections with other types of data on the premise a single common key can be identified. In addition, the model, specifically the model relationship(s), allows enrichment of log data that can be actionable. For example, one can generate alerts based on some enumerated relationship that is fed back into the log management or SIEM tool, i.e. If I see a certain number of events occur (>100 per second) at the same time AND only between two related nodes (web server and a database server), I may want to generate an alert, as this behavior is out of the norm.

What is a model?

A model is essentially a set of relationships that describe how one entity is related to another and how relevant it is based on the type of relationship.

LinkedIn and Facebook are social platforms that follow a similar approach with natural examples of a model, relationship, relevancy of those relationships and the number/type of connections, which stipulates a schema of sorts. Both platforms represent social graphs with different types of nodes/edges, such as friends, co-workers, relationships and so on.

One can visualize how this particular model can be used in a social graph context, as shown below:

Social Graph

Why should I care?

We think in models all the time. When we are trying to solve a problem, we immediately think about what is related to the issue and ask ourselves if it is important. The root cause analysis process itself looks like a tree flowchart, which is another form of a directed graph.

Here is an example:

So it makes sense that our data is connected in the way we solve problems.

The Log-Graph Connection

The simplest method of connecting data between logs and other systems is to use normalized time buckets and specific timestamps (though this can be very expensive due to the number of nodes and connections), the second approach is to use a common key such as Hostname, IP address (IP), location etc. (similar to primary/foreign keys in a relational database). The third approach is based on actual dependencies and is often the most difficult one to build and maintain. The three aforementioned approaches are not the only ones available but are important ones to consider, since in many cases they are the most common ways of linking data.

Time Bucket Relationships

Time buckets are exactly that, they are slices of time used to contain common events together in a bucket. Time buckets may help solve the challenge of knowing when to search vs. where to search. Knowing when to search can be difficult when it comes to performing root cause analysis or even doing proactive monitoring across mountains (or lakes) of data with unclear boundaries as to when a problem started (or is about to start) and finished.

By aggregating events (logs, metrics and config CRUD operations fall under the events category) into a time bucket and then iterating over those events with a counter, i.e. count by type, one can quickly visualize where there are intersections of problems occurring.

What does a time bucket look like?

Example of a Time Bucket Graph (1s buckets) in Neo4j

It’s pretty simple. A time bucket is essentially a node type with a label and one or more properties, i.e. 0s-59s,1m-59m,0H-23H,1D-31D,1M-12M

An event may have one or all of the above relationships. Each time relationship would terminate in a unique time bucket node. In combination, they could represent a specific date/time, or separately, be used to identify seasonal trends. For example, event spikes on 30D (30 days) may indicate end-of-month processing jobs.

Given that every event has a timestamp at creation and optionally at ingest time, it would be an simple matter to map a relationship to each node. As part of the commit transaction to the graph (in this case we are using Neo4j as our GraphDB), a query could be executed that traverses all known relationships, computes a sum of all events in each time bucket and then triggers an external event into Graylog if some threshold is breached.

If timestamps are not aligned, can they still be used? The answer is yes, on the premise the time skew between events isn’t too significant. For example, analyzing all events in a minute bucket vs. a second bucket will result in more analysis work, but it sure beats looking at 5 minutes, 50 minutes or 5 hours worth of data.

A practical workaround for severely time-skewed event data is to use ingest time as a way of normalizing timestamps. The downside of this approach is that it may be wrong in certain circumstances and introduce greater uncertainty into the root cause analysis process.

Key Relationships

The next level of association is based on a common set of primitives that each event shares. For example, a log message may contain a timestamp, a Message string, Host/IP address and Process ID.

Log Events

Circular relationships (links) within log messages can be defined by linking common fields across messages. As mentioned earlier, fields such as Host/IP address, User, Error Codes, and a multitude of other possible fields extracted from the raw message that are relatively common to logs may be used to link vertically multiple log messages and log sources. Thus, it is possible to turn a series of log messages into a story. Many tools use these re-occurring or repetitive fields for log deduplication/aggregation to support SIEM or event management use cases. The model process is slightly different for mapping relationships between nodes in that a single key is sufficient for a relationship. Whereas event deduplication (often referred to as log aggregation) generally requires the maximum number of fields to be identical in order to be useful. This results in first seen/last seen and the deduplicated message.

An example of linking log events together:

Linked Logs

Non-Log Events

Metrics and config events are handled in a similar way to logs. Look for common keys starting with coarse grained relationships such as host, network and so on. In most cases, a metric event (datapoint) will contain a timestamp, unit of measure (sometimes), a value and a counter/gauge type, Host/IP address, application name/source (sometimes), component name/source (sometimes). In contrast, a config/model event may contain a timestamp, a configuration item reference — Host/IP address, a configuration item descriptor, location and whether the configuration item is in production or not.

Combining these events together, one can see the common keys are timestamp and Host/IP address. In addition, there could be many other common keys, which will result in better data alignment and more relevant relationships proportional to the uniqueness and number of connections.

Another benefit to key relationships is that they automatically create a meaningful connection — you know there is a relationship via the shared key. Also, if there are significant time skews, i.e. timestamps are not aligned at a second level or even minute level, key relationships can minimize the impact by creating at least one or more meaningful relationships, which then improve the relevancy of a time bucket relationship.

Directed Dependency Relationships

The next level of association is based on the direction of the relationship. This is the notion of a dependency. For example, one can state that a web server is dependent on the underlying database service (on the premise that it is required to function). Therefore, if the database service is unavailable, one could infer that the web service is also unavailable unless there is another dependency relationship to a secondary database service (a redundant data service).

This level of relationship may require a significant amount of effort to establish and maintain, at least an order of magnitude greater than time or key relationships since often the relationships themselves cannot be discovered easily. The common approach is to integrate with a CMDB and/or ITSM DB but often these tools present out of date or incomplete models, unless they are tightly integrated into the entire application delivery and monitoring stack.

It may also be possible to infer dependency relationships based on observations of time/key relationships over time. As mentioned earlier, it is possible to implement complex relationships in log messages and the tutorial at the end of this post will show you how to do this.

For example, if we see that a web service goes down shortly after a database service goes down (based on the event state) then we may infer those two are related. Moreover, if they happen to share a common network path then that increases the probability — especially if the shared network switch is also unavailable.

Dependency relationships can also help provide meaningful time bucket relationships. If you know a web service is dependent on a database service and if the database service is unavailable, there will often be a delay between the database service failing and the web service failing. This results in a time difference between both events. It is expected but hard to correlate using timestamps, and it may limit the resolution of the time bucket since you may only be able to-use a minute time bucket instead of a second or a 5 second time bucket due to the difference in timestamps. However, with a directed relationship, one could infer the correct order of timestamps and then use a simple counter* with a timestamp sequence coupled with the direction of the relationship to determine the possible root cause.

In essence, the greater the number of connections between two different nodes, the easier it is to reason about the direction and nature of the relationship between the two nodes.

*Event count deltas as a ratio between normal and abnormal on each event can be used statistically as a way of associating anomalies, i.e. normal is 5 events per second for a web server, abnormal is 10 events. We see the web server goes into an abnormal state 10 seconds after the database server goes into an abnormal state.

Dependency relationship (directed)

web.graylog.org→ depends_on→db.graylog.org

Time bucket relationship (undirected)

web.graylog.org↔(1)minute↔db.graylog.org

Key relationship (undirected)

web.graylog.org↔production↔db.graylog.org

Simple Example:

9:05:XX — 5 messages (events) received from web server (web.graylog.org)

9:05:XX — 10 messages (events) received from database server (db.graylog.org)

9:06:30–30 messages (events) received from database server (3X more than normal)

9:06:45–100 messages (events) received from web server (20X more than normal)

The last two event rollups indicate abnormal behavior. Moreover, the abnormal event volume for the database service occurred before the abnormal event volume for the web server and the web server depends on the database server. Both are in production and the events occurred during the same minute. We can deduce there is a high probability that they are related incidents and warrant further investigation. We can also deduce that the database server is the most likely cause.

Log, Model and Metric Undirected Topology Example (conceptual)

Red color= time bucket relationship

Blue color = timestamp relationship

Green color = key relationship

Note: Circular and directed relationships have been omitted for clarity.

In the example above, a combination of a common key, a time bucket relationship and a timestamp are used to connect different data sources. Each relationship can be reasoned about separately or together in a single query. A trigger may then be initiated to drive an external action based on some condition evaluation or relationship connection.

Deep Analysis

Simple graph theories can be applied for analysis, such as nearest neighbours, number of relationships and depth of connections. It is also possible to plug-in a machine learning library such as a bayesian network algorithm or a multitude of other machine learning algorithms. Neo4j even has an R plugin that can be used for statistical analysis.

Intelligent Feedback

It is conceivable that an event could be emitted via a transaction hook (Neo4j supports this useful feature) that would send a new event into Graylog as a special log type (or simply use the Graylog REST API and call upon any specific method as needed). The log/event may contain a new rule for further log analysis/enrichment. Thus, overtime, Graylog may become more intelligent by leveraging a graph data structure. These graph relationships can provide intuitive insight into flat log data by dynamically generating rules based on inferred relationships and intersections of data.

Also, an event may be used to augment existing log data in the form of additional fields to an existing log message, pointing at the original for future search/traversal requests.

Future versions of Graylog will support search result enrichment, making it possible to drill down on a search result for more context; for example, executing a query in Neo4j and returning a table of results (event view) and/or a graph view of underlying connections (topology view).

Other use cases may include identification of non-redundant services within a given service topology. If we believe we know about all infrastructure components, it should be possible then to reason about redundancy based on N tier connections within the same group. For example, I would expect that a database service would have a redundant peer that is connected by a special cluster relationship. If I don’t detect two or more of the same nodes based on relationship type traversal, then I can say that the service is at risk of becoming unavailable if the database node goes down.

Conclusion

There are many possibilities for integrating logs and other types of data via a graph database. By using time buckets, key relationships and directed dependencies, one can quickly magnify the value of the existing mass of logs that are housed within Graylog. The end result is improved service delivery, better root cause analysis and less application defects being found in production.

Getting Started

So where to start? First you’ll need the Neo4j output plugin which can be found here. I’ve also created a demo content pack with detailed deployment instructions here.

10 steps to link your log data

  1. Identify the goal — Is it to link log events together in a more meaningful fashion to expedite root cause analysis? Is it to marry metric data with logs? Config data with logs? Both? Is the reason to improve root cause? Improve service levels through better proactive monitoring?
  2. Identify data sources — What do you use for your log schema? What do you use for a CMDB? What do you use for orchestration? What do you use for performance and availability data, what are the common elements between them?
  3. Identify common keys between the data sources that you would like to integrate with. Do they have hostnames? FQDNs? network address? are they virtualized/physical?
  4. Extract a sample of common keys between each of the data sources and verify you can link each dataset together in a meaningful way.
  5. Export a CSV/JSON dump from each source (if necessary).
  6. Verify each CSV/JSON has the common key that makes sense, i.e. host, IP address etc.
  7. Import CSV into target database (Neo4j in this example and it has very good tools for bulk loading CSV files).
  8. Update any specific log inputs with appropriate tags that can be used to map to imported data.
  9. Implement a query that dumps select log messages into target database.
  10. Implement a database trigger that generates an event based on some condition, i.e. matching field.

Special thanks to Marius Sturm @ Graylog for developing the Neo4j output plugin, writing the initial getting started instructions and contributing content to this post. Also I would like to thank Henrik Johansen for inspiring me to write this long overdue post with his blog on the 3rd use case for modern log platforms.