Network Graph Analysis for Suricata and Zeek using Brim and NetworkX
Welcome to our second article on Brim’s Data Science blog. In the first article in this series , we learned how to use Brim’s python library to fetch Zeek data into Pandas.
Today we’re going to build on what we learned last time. Instead of just looking at Zeek data by itself, we’re going to fuse Zeek and Suricata data together. We’re also going to improve how we visualize our network graph to gain some useful insights.
About Brim
If you’re new to Brim, Zeek and Suricata:
- Brim is an open source tool to search and analyze pcaps, Zeek and Suricata logs.
- Zeek is the most popular open source platform for network security monitoring.
- Suricata is an open source threat detection engine (commonly called Intrusion Detection and Prevention Systems).
Brim can import raw pcaps to enrich and analyze them with embedded Zeek and Suricata engines, and makes them available for search and analysis in the Brim app. Brim also provides a python library to support data science use cases and pipelines. We’ll be using Brim to create graph networks for network and threat activity.
Instructions and Prep
You can download Brim here
Installation instructions for Brim are here
Instructions for Brim’s python library can be found here
We’re going to be using NetworkX and Jupyter Notebook
Todays malware sample (password: infected) is courtesy of Malware Traffic Analysis and contains a Trickbot infection
Jupyter Notebook
The full and functional Gist for the code in this article can be found here
Getting started
First we’ll import all of our required libraries
Next, we’ll select our Brim space to work with. You can find your space in the Brim app in the upper left corner. You can right-click the Space and then copy and paste the full name.
Z Queries
We’re also going to need to define the Z queries we want to use. The first query is similar to the one we used in the last article. We filter for Zeek’s “conn” stream, and then cut out the id.orig_h
(source), id.resp_h
(target), and id.resp_p
(target port) and count for unique occurrences.
_path=conn | count() by id.orig_h, id.resp_h, id.resp_p | sort id.orig_h, id.resp_h, id.resp_p
Note that we’re using count()
to aggregate the logs. While this means that we'll lose some of the fidelity for calculating graph attributes such as clustering, it also pushes some of the heavy processing to ZQ and Brim.
For fetching the Suricata data, we’ll be using the following query:
event_type=alert | count() by src_ip, dest_ip, dest_port, alert.severity, alert.signature | sort src_ip, dest_ip, dest_port, alert.severity, alert.signature
The query filters for Suricata alerts withevent_type=alert
, counts and sorts by src_ip
(source), dest_ip
(target), dest_port
(port), alert severity
, and alert signature
.
Creating two DataFrames
If the queries executed correctly, we now have two DataFrames, df containing the Zeek results, and df2 containing the Suricata alert data.
Prepping the DataFrames
Before we can use our data to create a network graph, we need to do some data preparation. First we’re going to prepare two DataFrames to merge the Zeek and Suricata data, called dfz and dfs.
We’ll assign id.orig_h
and src_ip
to a column named source
to make indexing easier. We also need to change our count
columns for both data sources, or we'll have duplicate fields.
Merging the DataFrames
Now we need to merge the two DataFrames, using Pandas.concat()
. We should end up with a merged data frame, indexed by source
, target
, and port
, with the associated count of alerts and connection transactions attached to each connection. Also note we're setting ignore_index=True
to maintain a continuous index value across the rows in the new appended data frame.
Populate NaN fields
Because there are usually far more connections without corresponding Suricata alerts, we’ll end up with many records where the alertcount
and severity
will be unpopulated and filled withNaN
, we're going to populate all NaN fields
with 0
.
Recast types
Pandas.concat()
will type all numbers as floats, so we'll recast these as int64
.
Calculate weights
We’re also going to calculate some weights based off of the connection and alert counts. It’s not a fantastically sophisticated calculation: we divide 10 by the maximum value for count, multiply it by the count and add 0.1 (to avoid a divide by zero error). This will give us a range from 0.1–10.1.
We are going to calculate an alertweight
weight for force-directed graphs.
And we'll also be calculating a connweight
to colorize the edges representing the Zeek conn transactions.
Let’s print out some data about our DataFrame to validate that our calculations were successful and have been applied.
Creating our Graph
Our DataFrame is now ready to feed into a graph. We’re going to actually create two graphs, a MultiDirected graph , that can store multiple parallel and directed edges, allowing us to model connections to different ports and with different Suricata signatures, as well as whether they were sent or received by a node.
The second graph is a standard Undirected graph which we’ll use for algorithmic graph analysis.
Note how we define port
as the edge key, and we keep all of the edge attributes such as severity and alert weights by defining edge_attr=True
.
Add node attributes
We also need to add attributes to our node list, as this is not done automatically by networkx.from_pandas_edgelist()
We'll add the alertcount
, severity
, alertweight
, and connweight
, so that we can use these as weights when we draw our nodes.
Adjust for graph size
NetworkX is not the most ideal tool for visualizing network graphs with many nodes (it is designed primarily for graph analysis, see the section at the of the article on Large Networks).
1000 nodes is a safe limit for us to visualize.
Analyzing our graphs
Awesome — our graphs are constructed, so now we can start analyzing them to get a feel for what our data contains.
We’re going to look at a few different graph attributes and metrics.
Graph Density: A dense graph is a graph in which the number of edges is close to the maximal number of edges, i.e. with almost all nodes connected. Density is measured between 0 and 1.
Graph Transitivity: Transitivity is the overall probability for the network to have adjacent nodes interconnected.
Average Clustering: The average clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. Nodes tend to create tightly knit groups characterised by a relatively high density of ties.
Greedy Modularity Communities: Find communities in graph using Clauset-Newman-Moore greedy modularity maximization.
We’re also verifying if the graph is directed, and if it is already weighted.
Drawing our Graph
Now that we’ve created our graph, we can draw it. We’ll just pass the graph to a few different drawing layouts to get a general feel for what’s in the dataset.
Improving our visualization
Looking at the graphs, they are pretty ugly, so we need to do some work to make our data more legible and more importantly, make the visualization show something useful.
We’re going to create a number of different lists and dictionaries to use when we plot our nodes and edges in groups based on weights and with differentiated labels.
We’ll start by creating lists containing nodes that have either a global, private or reserved IP addresses, so that we can draw these with different colors and node shapes. Next, we’re going to add some dictionaries of nodes based on Suricata alert severity. This will allow us to draw node labels by their severity.
We also need to create a list of weights to automatically adjust the node sizes based on the alertweight
weight. This will make our nodes larger if they have a higher count of Suricata alerts
Directed graph — In and Out edges
As we’re using a directed graph, our edges have a direction called In and Out, and we can enumerate these directed via G.in_edges() and G.out_edges(). This allows us to draw incoming and outgoing connections individually.
We’re going to create a set of In and Out edges for each severity, 0–3, so we can draw these separately in different colors to help us identify critical connections.
We also need the list of weights for each of these groups, so that we can use these to draw our edges.
Lastly, we also need dictionaries containing our edge labels, so that we can draw the edges' individually by severity.
Drawing the graph
If your graph is a Small Graph, we can going to go ahead now and visualize our graph. If your graph is a large graph, there’s some help in the section Large Networks at the end of this article, but you may want to read through the following sections anyway, or you’ll miss some pretty graphs!
We’re going to be using NetworkX’s Spring Layout
It based on the Fruchterman-Reingold force-directed algorithm, and simulates an anti-gravity force that repels nodes from each other.
You’ll see that we are using the lists and dictionaries we created earlier for the edge_list
and node_list
, and edge_label
parameters. We also pass our calculated alertweight
weight. The weight will be passed on and used by the algorithm to determine the strength of the springs repelling the nodes.
Lastly, we also use the node_weights
list we created to draw an outline for each node, with the size of a node determined by the weight. We also color the node outline based on weight.
Analyzing the graph
There we have it! Our contextualized Suricata and Zeek Network Graph.
We can clearly see nodes of interest, because they have a halo around them, indicating that a large amount of alerts were seen against that host. And we can see if the nodes are internal, private IP addresses, or public and internet facing ones.
We can also see Suricata alerts between hosts, with different colors to show the severity. In addition, for hosts with many network connections the blue edges increase in intensity. This allows us to identify suspicious high-volume connections even if they don’t trigger a Suricata signature.
- Suricata Severity 3 = Red
- Suricata Severity 2 = Yellow
- Suricata Severity 1 = Green
- Network Connections = Blue
Different Graph layouts
We can experiment with different graph layouts to improve the legibility of the visualization and identify different patterns.
For example, below we first create a Circular Layout, positioning our nodes around a circle. We then pass that layout to the Spring layout to force the graph into a better order. You can immediately see that it bunches the nodes differently.
Shell Layout
Lastly, we can also use a Shell Layout
What’s nice about the shell layout is that it allows us to define groups, called “shells”, to plot our nodes in concentric circles. In our example, we’ll use the private and global IP’s as shells, but you can create different lists, for example based on Suricata severities.
You can see the global IP’s on the outside of the graph, and the private IP’s towards the centre.
Conclusion
We’ve combined Zeek and Suricata data to create a unified network graph, and we’ve used weights based off of connection and alert volume to help identify suspicious nodes. You can quickly highlight nodes and communications of interest to guide your further investigation.
Of course, there are always some improvements we can make:
- Develop a better weight algorithm: right now we don’t take the severity into account for example
- Assign the count of severity 1,2, and 3 to every edge, instead of drawing them separately. We are actually only showing a subset of the connection for each severity, even though it’s sufficient to identify suspicious nodes and activity.
- Adjust the edge line widths for the Suricata alerts based on weight
- Create a function to draw the graph
- Only plot connections and nodes above or below a certain threshold, for example based on
alertweight
Large Networks
While what we’ve done so far works really well for smaller networks, it’s the last point (5) that can be leveraged to help us still plot larger networks in a meaningful way. We can create a list of edges based on weights quite easily, for example.
strong_edges = [(u, v) for (u, v, d) in G.edges(data=True) if d["alertweight"] >= 0.5]
If you do find yourself analyzing a large data set, you can still identify the more interesting nodes this way.
Strong Edges:
[((IPv4Address('179.191.108.58'), IPv4Address('10.2.17.101')), 10), ((IPv4Address('10.2.17.2'), IPv4Address('10.2.17.101')), 6), ((IPv4Address('177.87.0.7'), IPv4Address('10.2.17.101')), 2)]
Weak Edges:
{(IPv4Address('10.2.17.101'), IPv4Address('40.122.160.14')): 1, (IPv4Address('10.2.17.101'), IPv4Address('98.142.109.186')): 2, (IPv4Address('10.2.17.101'), IPv4Address('10.2.17.2')): 10}
Don’t forget that you can also offload a lot of the heavy aggregation and processing lifting to Brim and ZQ like we did for the alert counts. Z has a growing set of aggregator and processor functions.
Next time — Graph algorithms
But we’ve only just begun delving into network graph algorithms, such as local clustering and communities. It is these that will allow us to work with larger data sets, by identifying communities and creating subgraphs, and also by plotting nodes with higher centralities.
For example, we can look at centrality metrics:
Degree: Measures number of incoming connections
Closeness: Measures the minimum number of step stone node needs to connect to others in the network
Eigenvector: Measures a nodes connection to other nodes who are highly connected. A node with a high degree is a key node like a router (or victim X spreading malware).
You can get a hint of what we can do with these below. We could for example plot the Greedy Modularity communities separately if we have several, or we could also use the centralities to create thresholds.
Next time we’ll learn how to work with larger data sets using centrality metrics and communities, so stay tuned!
Graph Communities and Centralities
# of Greedy Modularity Communities: 1
Top 3 Nodes with highest Degree Centrality
{IPv4Address('10.2.17.101'): 1.0, IPv4Address('40.122.160.14'): 0.01639344262295082, IPv4Address('45.14.226.115'): 0.01639344262295082}
Bottom 3 Nodes by Degree Centrality
{IPv4Address('45.14.226.115'): 0.01639344262295082, IPv4Address('40.122.160.14'): 0.01639344262295082, IPv4Address('10.2.17.101'): 1.0}
Top 3 Nodes by Closeness Centrality
{IPv4Address('10.2.17.101'): 1.0, IPv4Address('40.122.160.14'): 0.5041322314049587, IPv4Address('45.14.226.115'): 0.5041322314049587}
Bottom 3 Nodes by Closeness Centrality
{IPv4Address('45.14.226.115'): 0.5041322314049587, IPv4Address('40.122.160.14'): 0.5041322314049587, IPv4Address('10.2.17.101'): 1.0}
Top 3 Nodes by Eigenvector Centrality
{IPv4Address('10.2.17.101'): 0.7071067811865475, IPv4Address('52.183.220.149'): 0.09053574604251859, IPv4Address('13.107.19.254'): 0.09053574604251857}
Bottom 3 Nodes by Eigenvector Centrality
{IPv4Address('13.107.19.254'): 0.09053574604251857, IPv4Address('52.183.220.149'): 0.09053574604251859, IPv4Address('10.2.17.101'): 0.7071067811865475}
Further Reading
Complex Network Analysis in Python: Recognize — Construct — Visualize — Analyze — Interpret by Dmitry Zinoviev
Network Science with Python and NetworkX Quick Start Guide: Explore and visualize network data effectively by Edward L. Platt
A First Course in Network Science by Filippo Menczer, Santo Fortunato, and Clayton. A. Davis
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython by Wes McKinney