SRE: Knowledge Graphs: Increased Context in Human Involved Incident Response(IR)

Published in

Dm03514 Tech Blog

6 min readMay 29, 2019

Incident response involving human responders requires context of systems and services that are encountering issues. Getting this context is increasingly hard as the size of an organization grows and the number of services grow. Many times the incident responder bears the burden of forming complex mental models representing the systems they are trying to assess. As an industry our incident response tools aren’t keeping pace with our micro service architectures. A new class of tool is required to keep track of the essential service in business context. Linkedin’s Third Eye proves that these tools are technically possible but carry a huge implementation overhead. This post proposes a simple, low friction approach to centralizing critical events related to services (such as deploys) which reduces burden on IR engineer, reduces MTTR and makes querying complex system data and event state trivial.

An Example

Imagine that you’re on call for a service and a pagerduty alert fires: Increased latency on all requests for the service! What do you do? What information do you need to begin understanding the system, its current state, and how it got there? One of the most common heuristics in this situation is determining the last deploy.

In common IR response models, engineers may have to look at slack or jenkins to determine this information. Build information would be stored in metrics system (like datadog) as events which can be overlayed onto timeseries:

While this is extremely valuable it is not dependency aware and requires statically defining events (no dynamic queries); meaning there is no way to model more than a single degree relationship using events and overlays. In comparison IR Knowledge Graphs enable engineers to dynamically see events across their service and all its dependencies which provides the on IR with rich context around incidents and potential influences for those incidents. IR Knowledge Graphs are able to provide a rich picture of the current real time state of the system and the events that contributed to that state. This differs significantly from overlaying known events on a graph!

So What is an Incident Response (IR) Knowledge Graph?

An IR Knowledge Graph (not to be confused with google’s knowledge graph) is able to store structured graph based data and is able to easily query data and relationships. It provides a centralized location to query the current system state and events that led to that state. This is a graph in its traditional sense it incorporates information that is critical to developing a system understanding along a dimension of time.

Storing the system state and associated events along time support understanding which events led to the current state and trace causal events that affect state over time. As the system changes state it’s critical to have as much context around those changes available as possible. The current state of the industry has state changes fractured across multiple tools teams and mental models (jenkins, jira, github, slack, pagerduty, etc etc etc!). Having so many disparate stores fractures system understanding which handicaps human Incident Responders, discourages automated response and prolongs incidents. In contrast IR Knowledge Graphs centralize information from disparate sources in order to provide a sane view into system state while maintaining system structure and exposing it for ad hoc analysis.

In order to make this more concrete the rest of this post will build an IR Knowledge Graph and show how it can be used to make service events and system state available to human responders. This IR Knowledge Graph will support retrieving a service (and its dependencies) latest deploy, the result/status of those deploys (in progress vs complete) and their durations. In order to do this it must support the following queries:

Most Recent Deploy & State for an individual Service
Dependencies most recent deploys for a given Service

Components

The IR knowledge graph requires two component in order to support the above queries:

Logical Topology — structure of services and their dependencies
Events — Important temporal actions that target services

These two components can be visualized below:

What’s not included in this version of the IR Knowledge Graph are physical components: Hosts, VMS, Load Balancers etc, which underly the logical components (shown above).

In Practice

Next we’ll define each IR Knowledge Graph component using our architecture.

Logical Topology

First we’ll create a logical topology of core services and their common infrastructure related dependencies:

Depending on how often these change, it may not even warrant an automated process.

Events

The next step are overlaying events which are temporal, and reference nodes of the logical service topology:

Events can be tracked by submitting Jenkins build information (relative to a build id) to the knowledge graph.

The Incident

Pretend that alerts begin to fire for “Users”. An increase in latency is detected. The IR Knowledge Graph supports querying for the services builds (and dynamically for all events associated with the service), which shows that there is currently a deploy in progress:

Users Dependencies

With a graph it’s also easy to query the deploy status for all components that Users depends on (components shown below):

Since relationships are encoding in the graph recent events affecting each of these dependencies can also be dynamically queried:

Remember from the event graph that both flags and photos had deploys (shown in the image above), and both of those services are dependencies of Users.

It would be easy to imagine all infrastructure events (ie aws state changes) available for querying as well as Jira or github information for quickly determining the origin or composition of deploys. The events that are stored are only constrained by machine storage, and using systems such as DGraph supports horizontally scalable distributed graphs.

Conclusion

While SRE is still a young discipline Linkedin has proved that it is technically feasible to centralize an entire companies metrics in order to provide a uniform contextually rich system view. This post proposes a low friction way to begin to centralize system information in order to enhance the context that incident responders have available to them. The crazy thing is that these queries are exposed for free just by mirroring the real life graph structure by using a graph database. This literally isn’t an application it’s just data with relationships!

Graphs

The examples were created using Neo4j desktop for mac (https://neo4j.com/download/).

References: