Visualizing IP Traffic with Brim, Zeek and NetworkX
Introduction
Network Graphs are a way of structuring, analyzing and visualizing data that represents complex networks, for example social relationships or information flows.
A typical application, and of special interest for threat hunters, modelers and analysts, is the modelling and analysis of TCP/IP network communications.
With the release into open beta of Brim’s Python library, it’s never been simpler to bring the world of Zeek and Network Graphs crashing together. Let’s do some Security Science!
Prerequisites
Brim
You will need to install Brim on your local workstation where you will be launching Jupyter from.
TIP! You can find detailed installation instructions for Brim on Windows, Linux and macOS under https://github.com/brimsec/brim/wiki/Installation
Brim installs ZQD, the zqd daemon, which serves a REST API to manage and query log archives, and is the backend for the Brim application.
The Brim Python library connects to ZQD to send queries and fetch data.
Anaconda and Jupyter
You will also need Jupyter Notebook (https://jupyter.org/). Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.
We recommend installing Anaconda (https://www.anaconda.com/), an open source Data Science platform that includes Jupyter Notebook, alongside a number of other useful applications and tools.
Python Modules
Our tutorial also leverages a number of different Python libraries.
Brim Python library
The Brim Python library is currently in open beta. You can install it via “pip”:
pip3 install "git+https://github.com/brimdata/zed@v0.29.0#subdirectory=python/zqd"
Note that the library requires Python 3.3 or higher.
Pandas
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Install it via pip:
pip3 install pandas
See https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html for more information.
Matplotlib
Matplotlib (https://matplotlib.org/) is used to visualize data, and NetworkX uses it to draw and plot our graphs. You can install it via “pip”:
pip3 install matplotlib
Networkx
We are using NetworkX as our Network Graph library. NetworkX (https://networkx.org/) is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. You can alo install it via “pip” from the command line.
pip3 install networkx
For more detailed information regarding the installation, see
https://networkx.org/documentation/stable/install.html
Sample Data Sources
We are working with two primary data sets in this tutorial.
TZNG files from https://github.com/brimsec/zq-sample-data/tree/master/tzng
The Emotet Malware sample from Malware Traffic Analysis Net we used for our “Hunting Emotet with Brim and Zeek” article:
https://www.malware-traffic-analysis.net/2020/09/02/2020-09-01-Emotet-epoch-3-infection-with-Trickbot-gtag-mor119.pcap.zip (password: infected)
Getting Zeek data into Pandas — via Brim!
TIP! The next section uses the accompanying Jupyter notebook: https://gist.github.com/orochford/4489198fd4d94b772fb8a0da8be3c315
Let’s start by ensuring that Brim and ZQD have been started. If Brim is not running, launch it. ZQD will automatically be started in the background. If you haven’t already done so, import the sample data into Brim. You can simply drag and drop a sample file into the UX and a new Space will be created for it:
Connecting Jupyter with ZQD
TIP! You can follow along with the code in tutorial in the accompanying Jupyter Notebook:
Let’s continue with a new Jupyter Notebook and by connecting to ZQD, sending a ZQL query, and then getting our return data into a Pandas DataFrame:
After we import our dependent libraries, we do the following:
- We define a variable “space” for the Brim Space we want to query, in this case ‘2020–09–01-Emotet-epoch-3-infection-with-Trickbot-gtag-mor119.pcap’.
The space will share the name of the imported sample files
2. We also define the ZQL query (variable ‘zql’) we want to send to ZQD
zql = '_path=conn | cut id.orig_h, id.resp_h, proto | sort id.orig_h, id.resp_h'
3. We create our client instance, and then open a connection to ZQD
4. Lastly, we create a pandas DataFrame “df” from the returned data, flattening any json or dictionary values.
5. Voila! We’re done
Validating our data
Now that we’ve imported our data into a Pandas DataFrame, we should conduct some validation to make sure that everything worked out as expected.
If everything went according to plan, you should now see a number of different metrics and characteristics, including how many records the DataFrame contains, the columns and data types.
We also checked if any fields exist with missing data, and dropped them if they do.
TIP! There are better practices for managing missing data, but for now we’re not expecting any, and for our purposes this will suffice to make the data suitable for what we want to do next.
Drawing our first network
You may have noticed that our ZQL query returned no quantitative values, for example the bytes sent per connection. In future articles, we will also discuss how Pandas can be used to analyze such metrics. But today, our focus is on using our connection data to create a network model. For this, we will create a network graph using NetworkX.
NetworkX is a Network Graph library that supports the generation, creation, manipulation and visualization of network graphs. Network Graphs are very useful to model and analyze data that represents flows, relationships or connections. This makes it especially useful to analyze data from social networks, email communications, or in our example, network data
Our data is now in a usable format to generate a network graph of the IP connections
Network Graphs view the world through Nodes and Edges. Translating these to our network world, a Node is a host, and an Edge is a connection between two hosts. We can also dress the Edges (our connections) with data that describe them. In our example we will distinguish between TCP, UDP and ICMP traffic.
Because our ZQL query already returned our data in a format we can use directly, we can use the networkX “from_pandas_edgelist()” function.
networkx.from_pandas_edgelist() expects the input to be the Source and Target Nodes, followed by any additional attributes. In our example, setting “edge_att=True” means that any additional values in our Pandas DataFrame will be added as edge attributes.
TIP! NetworkX supports 4 basic Graph types for different types of complex networks (see https://networkx.org/documentation/stable/reference/classes/index.html for more information). For our purposes we want to use what is called a Directed Graph, so that we can map the direction of our connections.
Let’s start investigating the graph we’ve just created:
We can see that our graph has 2 unique source nodes and 29 destination hosts.
We also have 29 edges — and we can see that many of our nodes (hosts) have multiples of what are termed “Degrees”. Degrees are the network graph equivalent of IP connections between two hosts. G.degree() prints out a list of every node and how many edges it has, giving a total count of unique IP connections. This already hints at the hidden power of network graphs, and we’ll be using that value again later. Note how the values are already correctly typed — this is due to Brim’s ZNG data format’s embedded data types.
Visualizing our Network Graph
Now we have created our network graph, we can visualize it using NetworkX’s default settings. With a nice small dataset, this works quite well, While it’s not pretty, we can clearly see the connections emanating from the two central nodes.
Avoiding the Fuzzy Hairball and making things pretty!
TIP! The next section uses the accompanying Jupyter notebook: https://gist.github.com/orochford/4489198fd4d94b772fb8a0da8be3c315
While our code now works well for smaller data sets, the resulting output is pretty ugly. Everything seems overcrowded and it’s hard to make out the labels and relationships. Worse is that if we use a larger data set, for example our ‘tzng’ sample data with more Nodes and Edges, we get what’s affectionately called the “Fuzzy Hairball” by data scientists.
It’s all about the Style!
TIP! The next section uses the accompanying Jupyter notebook:
We’re going to do three different things to address these shortcomings:
- Limiting the sample size
We’re going to set a value to use as an upper limit for what NetworkX can sensibly visualize for our use case, and check the count of records in our DataFrame against this. We’re also going to set a value to use as a sample size, in case our limit is exceeded. We can then use the sample size to fetch a random sample using Pandas. If you rerun the notebook, you will see a different constellation every time, but high outliers should reoccur frequently if the sample is representative.
TIP! Note that we could have also used ZQL’s “head” processor to limit the number of records to fetch from ZQD. This would limit the amount of data we have to work with in memory, but at the cost of random sampling. Using Panda’s sample() function instead, provides us with a true random sample to analyze.
As NetworkX is not primarily a visualization library, it is better suited for smaller or less complex visualizations. If you do find yourself needing more control over the plotting, NetworkX supports GrapViz output amongst others, for heavier lifting.
2. Adjusting the plot size
We can also quite easily adjust the output size of our networkx graph plot via the figure “figsize” parameter. The parameter expects values for width and height in inches.
plt.figure(figsize=(Width x Height))
For example
plt.figure(figsize=40x40)
Plotting a larger figure will allow us to fit more nodes and edges on to our visualization.
The challenge we face is that we don’t know in advance how many Nodes and Edges our data contains, so we need a more dynamic approach here.
As we’re already checking how many records our sample has, we will add some logic to return a size. We can use this later to set the figure size and also apply specific styling depending on how many records our graph has.
3. Apply styling
Lastly, to make our visualization easier to read, and also to help visualize the flow and protocol composition, we’ll add some styling to our plot.
First, we add some code to create different Edge lists based on IP protocol (tcp_list, udp_list and icmp_list). We will use these to apply specific styling to visualize different IP protocols distinctly.
We’re also going to set the plot figure size based on the graph size we determined earlier, and we’ll apply adapted style options to adjust these for better legibility.
The Plot (line) thickens!
Instead of just using the Spring Layout, we’re going to plot our graph with several layouts at the same time.
We’ll be iterating through our layouts with a FOR loop to draw a subplot for each layout.
Also, instead of using the standard networkx.draw_networkx() function as we did last time, this time we’ll draw our network graph bit by bit, to have more control over what we draw.
You can see how instead of hardcoding the style attributes, we’re using the style variables we set earlier instead. Not only do we gain more control over how we visualize our graphs, we’re also making the code much easier to read and maintain.
TIP! You can find out more about the graph draw parameters and their meaning under: https://networkx.org/documentation/stable//reference/drawing.html
You can also see here that we are drawing our Edges group by group, to be able to apply very granular styling to each Edge type. We’re using the lists we created earlier based on the “proto” column to determine which Edges the styling should be applied to.
Lastly, to draw our nodes, we’re actually using a network graph function, (G.degree).values(), to dynamically change the size the nodes are plotted with.
G.degree() provides the number of edges adjacent to all nodes, or to phrase it another way, the amount of connections to a specific node. We will be using it as a multiplier to plot a node larger based on the number of connections to it, essentially as a weight.
When we plot our visualization now, we see a very different picture. Our connections are color-coded and labelled by IP Protocol, and our Nodes appear larger if they possess a lot of connections. You can clearly see the knots, an aspect we’ll be exploring further in a future article. We also see the direction of the connections, indicated with arrows. With aesthetics being as subjective as they are, we encourage you to play around with the variables until you find a styling you find attractive.
Conclusion
We hope this gives you a good starting point to explore Zeek data using Brim, Pandas and NetworkX, and also some ideas of where to go next. One quick tip is to play around with the network visualization layouts. See https://networkx.org/documentation/stable//reference/drawing.html#module-networkx.drawing.layout for some more ideas.
In our next article, we will take a look at how we can apply network graph and visualization methods to hunt threats such as malware. In the meantime, download Brim, and join our Slack Channel.
Further Reading
Complex Network Analysis in Python: Recognize — Construct — Visualize — Analyze — Interpret by Dmitry Zinoviev
Network Science with Python and NetworkX Quick Start Guide: Explore and visualize network data effectively by Edward L. Plattspr
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython by Wes McKinney