Network Threats from Internet Traffic Patterns

Pranav Vijay
INST414: Data Science Techniques
5 min readMar 22, 2024

Question: From looking at internet traffic patterns, what IP address has the most connection attempts from other IP addresses?

I chose to focus on the number of different IP addresses targeting one single IP address.

The stakeholder asking this question is a cybersecurity personnel who wants to identify the IP address that multiple hackers are attempting to get into.

The answer to this question will lead to the stakeholder taking action to increase security for the IP address that is being targeted by multiple hackers.

The data that can help to answer this question is a dataset of the source IP addresses and the corresponding destination IP addresses that were targeted over the past year for a network. This is relevant to my question since it will help me create a directed graph of nodes and edges that can help me identify which node(IP address) has the most edges coming into it.

I used the Kaggle website to collect a subset of this data needed for my analysis. Kaggle contains many datasets that are available for downloading. The dataset was called “BETH Dataset,” which was uploaded by Kate Highnam. There were 15 CSV files as part of the dataset. I chose to focus on the “labelled_2021may-ip-10–100–1–26-dns.csv” file since I didn’t want my network to be too large and contain too many nodes. This subset only contained data for May 16, 2021. I created a Jupyter notebook to create a program to analyze the CSV file. I used a Python kernel to run my program. I downloaded the CSV file of the dataset from Kaggle and read it in Python through a dataframe from the Pandas library using the read_csv() function. I used the networkx library to build a graph for my network. I used the DIGraph() function to create a directed graph for my network. I added nodes using the add_node() function in order to create nodes for IP addresses. I added edges using the add_edge() function in order to create a connection from the source IP address to the destination IP address. I used the draw() function to create the graph. I chose to use a circular layout in order to space out the nodes. I used the pyplot module from the matplotlib library to show the graph and add a title for the graph.

An IP Address entity is represented by a node/vertex in the graph. The edge represents a path that packets travel from one IP address to another IP address. The edges are formed from the source IP address to the destination IP address. I chose to have a directed graph in order to show which node is the source IP address and which node is the destination IP address. If my graph was undirected, I would not be able to tell which way the edge is going and would not be able to tell if the attack was one-way or two-way between the nodes.

Importance in my graph is IP addresses that have the most connections toward them. The important nodes are the IP addresses 10.100.0.2, 10.100.1.186, 10.100.1.26, 10.100.1.4, 10.100.1.95, and 10.100.1.105 since these have at least 1 edge coming into the node, or an in-degree of at least 1. The node with IP address 10.100.0.2 has an in-degree of 5. The nodes with IP addresses 10.100.1.186, 10.100.1.26, and 10.100.1.4 have an in-degree of 4. The node with IP address 10.100.1.95 has an in-degree of 2. The node with IP address 10.100.1.105 has an in-degree of 1. The nodes of the remaining IP addresses have an in-degree of 0.

I used degree centrality to analyze this network and find the answer to the question posed earlier. I focused on the in-degree of the nodes in the network since the question was to find out what IP addresses had the most connections toward them from other IP addresses. From the graph, the node with the IP address 10.100.0.2 has the most number of edges pointing toward it. This means that this node has been hit the most times by other IP addresses in this network.

Here is a visualization of the network of BETH IP Addresses from the original CSV file:

I cleaned up this data in my dataframe by removing columns from the CSV file that were not necessary in my analysis of the network. In my dataframe, I removed the “Timestamp,” “DnsQuery,” “DnsAnswer,” “DnsAnswerTTL,” “DnsQueryNames,” “DnsQueryClass,” “DnsQueryType,” “NumberOfAnswers,” “DnsResponseCode,” “DnsOpCode,” “SensorId,” “sus,” and “evil” columns. I left the “SourceIP” and “DestinationIP” columns in the dataframe since these were the columns that I needed in order to create the network of IP addresses and find out which nodes were important according to my definition above. I also checked my dataset for missing values. There were missing values in the “DnsQuery” and “DnsAnswer” columns, but these were not columns I was focusing on. After reviewing the dataset, the “SourceIP” and “DestinationIP” columns, the columns I am focusing on, did not have any missing values.

There are some limitations to my analysis. One limitation of this analysis is that I only selected one of the CSV files to analyze out of the 15 CSV files from the BETH Dataset. If I had compiled all of the CSV data together and analyzed the larger dataset, I may have reached a different answer for the IP address with the most number of hits. All of the CSV files only had data from May 10, 2021. Also, May 10, 2021 was a Monday, and I am not sure if the day of the week has any impact on the internet traffic. If there was data for a full week, including weekends, the results might have been different. I would also be able to understand the normal traffic pattern and what days would be anomalies. My analysis might be biased since I am only focusing on the number of IP addresses that are targeting another IP address. I am not looking at the scenarios where one IP address is targeting another IP address multiple times. I believe that multiple people attacking an IP address is a bigger threat than one person attacking an IP address multiple times because it shows that multiple hackers from different locations are attempting to hack into one IP address, which could contain valuable data.

Here is a link to my GitHub repository that contains a Jupyter notebook I used to create the graph visualization. The GitHub repository also contains the CSV file of the original dataset that I analyzed.

Link: https://github.com/pvijay2024/module2

--

--