Feeding Graph databases - a third use-case for modern log management platforms.

The motivation for collecting log data in a modern log management platform usually falls into 2 distinct categories — one being the ability to gain insight from the collected data (for purposes such as alerting,reporting, trending,root-cause analysis,security operations,etc) and the other being compliance or for some organizations even a combination of both.

The emerging third use-case would be linked-data analyses. Let me explain …

We collect tens of thousands of events per seconds from thousands of different sources across our entire infrastructure for all the various reasons stated above — a busy day for our Graylog cluster can be measured in hundreds of GB’s of additional data gathered.

All of that delicious data can be harvested to drive different kinds of insights and we are always looking for more interesting stuff to discover in our data - stuff that can help us run our business as efficiently as possible.

I would like to share one such discovery that has the potential to profoundly impact the way we reason about our business operations.

A little story

We came to a sudden realization a couple of month ago; we knew that the logs that were getting processed by Graylog held a lot of individual pieces of insight into our infrastructure operations but is was really hard to reason about them at scale. We needed a way to look at those individual pieces of the puzzle in relation to each other in order to discover a more complete picture.

Graylog generally does a fine job at allowing individuals or companies to derive insights from their data using the built-in functionality but for certain occasions other, more specialized, tools are required to unlock the full potential that resides in the collected data.

We treat Graylog as our primary datahub for log data thus facilitating a central viewpoint of all data while having the ability to selectively forward data to other systems via the excellent Graylog output plugins.

In order to formalize our initial experiment we wanted to find the answer to 3 different questions:

  1. What are the dependencies within and between different systems?
  2. What impact would changes to our infrastructure have?
  3. How can we track certain behavior to address potential security concerns?

To solve those particular problems we turned our attention to Graph databases. Why Graph databases you might ask? Well, we speculated that analyzing computer network relationships isn’t much different from Social Network Analysis which Graph databases have somewhat of a reputation for.

We chose to play with Neo4J — it’s a well known, well documented Graph database that ticked all our boxes and thanks to the work of @mariusstorm Graylog now has an experimental output plugin that can forward data into Neo4J making our experiments much, much easier.

A quick disclaimer — I am by no means a subject matter expert on the topics below, just a curious individual.

The fun stuff

#1 : What are the dependencies within and between different systems?

Many different logs contain information about the communications between entities — anything that contains a source and destination IP for example is sufficient to indicate who’s talking to who.

Let’s start with something simple, such as Netflow data.

The basic setup is really simple — the Graylog Neo4j output is configured to issue a couple of Cypher queries :

MERGE (source:IP { address: ‘${netflow_ipv4_dst_addr}’ }) 
MERGE (destination:IP { address: ‘${netflow_ipv4_src_addr}’ }) MERGE (source)-[:CONNECTED_TO]->(destination)

This will create a Node in Neo4J for every source and destinations address ( individual fields in Graylog) and also create a relationship between the source and the destination.

In Linkurious that looks something like this :

We really wanted to showcase this to different interested parties within our organization so we needed a way to expose the data in an easy to use interface without forcing users to write their own Cypher queries - Linkurious seemed the perfect candidate. It’s a really great interface for searching, exploring and analyzing Graph data in an intuitive way without requiring special domain knowledge.

Now we can search for any machine and use that machine as a starting point for exploring the Graph, unveiling the intricate relationships between machines.

Besides the actual IP addresses we store additional context as properties on our Nodes such as :

  • the actual hostname.
  • a “last_seen” UNIXTIME timestamp.
  • one or more links to the documentation for that machine or system.
  • a link to the events section for that machine in our monitoring system.
  • a link to Graylog that displays the last 24 hours of logs from that machine.

These are available directly from the Linkurious interface and thus allow for fast and easy access to the most relevant contextual data for our Operations team.

#2 : What impact would changes to our infrastructure have?

Answering this question let’s us do some pretty interesting things like validating the impact of an RFC or maintenance window or simulating different failure scenarios. Another intriguing option is to ask questions like “What’s the most critical component for this system “… or “What are the most critical dependencies in this datacenter”.

We thought that this would be easy once we had a solution to our first questions — turns out that we were wrong :(

The fact that 2 machines were communicating wasn’t really enough — we wanted to know WHAT was communicating not only WHO. We wanted to know exactly what applications where talking to each other.

So we turned to another log source in our Graylog cluster — the sysmon logs from our machines. Sysmon actually tells you what executable initiated or responded to a network connection :

Sysmon has a native filtering mechanism so you can control or filter out unwanted data directly at the source — you might not be interested in broadcast traffic or chatter from your monitoring system.

Besides Sysmon data we played with data from our Storage arrays — we do have a lot of CIFS shares that various systems are depended upon.

To be honest we’re still not sure about how we can model this in a way that is easily comprehensible when visualizing the relationships. The current experiment introduces another node & relationship type (a node :HAS an application that :TALKS to other applications)

We are now able to attach applications to our nodes and then create relationships between those different applications — for example that ‘c:\app\fat_desktop_client.exe’ on machine X was initiating connections to ‘c:\server\app_server.exe’ on machine Y and that machine Y has an active connection to a fileshare named ‘fat_app_share’ on storage cluster Z and a connections to ‘sqlserver.exe’ on machine Q.

The model looks something like this :

Time will tell if we need to rethink this way of structuring our Graph, thankfully that’s pretty easy to do in Neo4J :)

There was another issue — we didn’t really model which system a given machine was part of. We run hundreds of different systems and keeping track of what’s part of what is challenging at times. We actually have a mapping of those relationships in our monitoring system so we’re currently testing some code glue that let’s us apply labels in Neo4J based on that data - each node will be labeled with the name of the system (or systems) that it’s part of.

Once something like https://github.com/Graylog2/graylog2-server/issues/644 get’s implemented this sort of job could happen directly within Graylog without the need for special code.

#3 : How can we track certain behavior to address potential security concerns?

Information security is a real differentiator for almost all industries these days and we’re paying much attention to that part of our business operations so it was natural to look at our data from that perspective too. Since this is the youngest part of our research into Graph databases it’s also the least evolved part at the moment.

The first use-case I would like to share with you exemplifies the notion that data can be reused for different purposes. To answer Question #2 we utilized the information that was present in logs from Sysmon — the same data can be used for security related purposes as well :)

The Sysmon logs that we ingest also record every binary that get’s executed on our machines — this includes stuff like the full path to the image file, the parent process, file hash, user context, etc :

A really simple but useful tactic would be to Graph the parent-child relationships in a hierarchical Graph (I have removed the process names to protect the innocent) :

This is very interesting as part of a potential incident investigation but it also forms the basis for playing with statistical algorithms to detect potential outliers across thousands of machines. You can also formalize a model where you’re looking for unusual relationships — such as notepad.exe spawning cmd.exe … or services.exe running under a user context.

Essentially that is a Graph model of what I have written about yesterday :

https://gist.github.com/henrikjohansen/81d01b200ea3e58329ea

Another source of data that most people already have in their log management platform is information about user logins — it’s trivial to build a Graph that let’s you model login behavior and potentially also detect behavior that should warrant further investigation.

The last example in this category is my personal favorite and something I am still reasoning about — given that most ransomeware exhibits a behavioral pattern that’s … unusual … from a storage perspective it should be possible to model audit data from a storage system and detect that behavior in real-time … or at least that’s what I hope. Anyways, it’s an interesting problem to tinker with :)

Additional findings

Given the fact that we now have been able to answer all of our 3 original questions we discovered a couple of other interesting things along the way that might become handy in the future :)

If you continuously maintain the nodes and relationships in your data you also create a complete history over how those have evolved over time. In our case it is very beneficial to be able to look at this evolution over time, to be able to see exactly how the interaction between components have changed over time.

Conclusion

This has been a very interesting project to play with - we will refine our models and probably already in Q1 2016 put parts of it into full-blown production.

We could have achieved the same results using something like Hadoop, Spark, etc but that requires a lot more dedication and knowledge. Using tools that are inherently more simple to operate and run make these techniques available to a lot more people.

The stuff outlined above might not represent a typical or complex Graph problem - this is somewhat deliberate since we wanted to demonstrate what can be achieved without having special domain knowledge about Graph theory. This however does not mean that Graph theory cannot be applied to these types of problems; anomaly detecting using clustering algorithms is something we already are playing with (and with good results, too). Another typical use-case would be finding the shortest path between a user and a certain node … or any node of a given type.

The term “linked data” or “linked data analysis” has been gaining some traction in the industry - the videos below provide some additional insight and inspiration specifically for information security purposes :