Anomaly Detection On IP Address Data

A Simple Example using Gaussian Mixture Modeling (GMM) Clustering

Rajaram Suryanarayanan
Geek Culture
12 min readJun 23, 2021

--

Photo by Randy Fath on Unsplash

In this article, we will try out unsupervised Machine Learning methods of clustering, for grouping hosts in a IP network in to different clusters based on the similarity in their IP addresses. Importantly, we will see the difference between K-Means and GMM (Gaussian Mixture Model) in their approaches in clustering, and compare their results. We will also see how we can find out the outliers/anomalies in the data using GMM more effectively/

Introduction

Clustering is an unsupervised class of ML algorithms which separate and group given data points in to distinct groups of data in such a way that data points in any cluster are similar to each other in features. Clustering uses the features we supply it to learn the similarities patterns among the population of data points, and then group them accordingly.

There are many different clustering algorithms like K-Means clustering, DBSCAN and GMM to name few popular ones.
In the following exercise, we will extract source and destination IP addresses of packets from a huge data set of network packets and cluster the IPv4 addresses in to different groups based on their four octets, using unsupervised ML method of clustering.
On the way, we will do some anomaly detection and also see how GMM and K-Means compare with each other. Anomalies are outliers ( though they are members of the formed clusters ) . They are located relatively far away from the rest of the cluster population because of a large degree of dissimilarity, or in other words a very small degree of similarity, with others in the same cluster.

Basics Of K-Means and GMM Clustering Algorithms

We will not go deep in to the details of how K-Means and GMM algorithms works, as that will be too much for the scope of this writing. Instead we will apply both to the task of clustering a population of IP addresses, note how the different approaches taken by the two algorithms have an impact on the resulting cluster formations, and how GMM appear to be better suited to the data set I have in hand.

Some basic information about both algorithms are as follows.

K-Means algorithm loop iterates over two steps, after choosing initial cluster centroids, till the resulting clusters converge.

  1. Assign each data observation to the closest cluster centroid
  2. Update each centroid to the mean of the data points assigned to it.

So, for K-Means, every data point is assigned to any of the one clusters, this is known as Hard Clustering or Hard cluster assignment.

Hard Clustering: In hard clustering, the data points are assigned to any one cluster completely. That means for hard clustering there is no consideration of uncertainty in the assignment of a data points.

Soft Clustering:
In soft clustering, the assignment of a data point to a cluster is based on the probability or likelihood of that data point to existing in that cluster.

For soft clustering, we have an algorithm called GMMs or Gaussian Mixture Models.
GMM has two advantages over K-Means:

  1. GMM is a lot more flexible regarding cluster covariance and more suited for ellipsoidal shaped blobs.
  2. GMM model accommodates mixed membership.

Let’s now get in to the details of the example exercise..

Data Pre-Requisite

I took a publicly available pcap file containing millions of packets, extracted packet details like IP header fields, TCP header fields etc and saved it in a large csv file. I used the popular python tool Scapy to extract the protocol data of each packet in the pcap file. So, we will start the exercise with a ready-made csv file containing millions of rows with each row describing one network packet.

Code For This Exercise

The full code for this exercise can be followed in this github link.
The notebook is mostly self-explanatory, so I will explain only the important steps and aspects of the exercise in this article. The reader is advised to walk through the full code for the other steps skipped here.

1. Data Preparation

Read the csv file and extract source and destination IP addresses

Above, we have read the csv file containing the packets data which has nearly 6 million rows and 39 columns. Each row contains details of one packet. We are going to use only isrc ( Source IP Address ) and idst ( Destination IP Address ) of the packets for our further analysis.

After data cleaning and dropping duplicate entries, what we have are 187 Source IP addresses and 2673 Destination IP addresses for analysis.

2. Features extraction from IP addresses

For clustering of data, we need to choose particular features of the data based on which the clustering algorithm can divide the data in to clusters of similar members. Here, since we want the clustering to be based on IPv4 addresses, we need some way to encode the IP address in to a vector of features.

We have chosen to split the four octets of each IPv4 address and use them as a vector of length 4 to be fed to the clustering algorithm to represent each data (IP address ). The reasoning behind choosing this approach was taken from this research paper.

We then apply PCA to the X_matrix to reduce the dimensions from 4 to 2, so that we can visualize the clusters by scatter plot easily with 2 dimensions.

When we scatter plot the pca1 vs pca2 of src_ip_df , we get the following plot

Visually, it seems like 6 or 7 small clusters are there. Broadly there are two big clusters with ellipsoidal shape and similar orientation. And the cluster patterns are more ellipsoidal, particularly the ones on the left.

3. Use Gaussian Mixture Modeling (GMM) Clustering algorithm On Source IP Address data

Gaussian Mixture Model (GMM) clustering suits ellipsoidal shaped clusters more than K-Means clustering which suits spherical blobs. GMM also is a probabilistic clustering algorithm and provides easier way to detect anomalies. Let us try GMM to cluster the source IP addresses.

We use BayesianGaussianMixture class which helps us finding the optimum number of clusters needed for the given data. If we input a number of clusters that is more than what is optimum, it eliminates unnecessary clusters by giving weights equal to or close to zero to those unnecessary clusters.

We tried 7 clusters, but 3 seems enough, going by the weights given to the clusters. We then train BGM (Bayesian Gaussian Mixture ) on source IP data asking for three clusters to be created. In the output seen above, we see the weights assigned by BGM for each of the three clusters created by the algorithm.

We can also get the Gaussian means of the three clusters and transform it via PCA as seen below, for locating the means of the clusters during visualization.

Next we predict the cluster for all the source IP data

As seen above, we store the cluster label for each IP address as a separate column ‘kcluster’. Also we create separate data frames for each cluster so that it will help in plotting the clusters.

We plot the three clusters created by GMM as the below image shows.

We can see 3 clusters formed neatly, but do contain some outliers/anomalies.

On analyzing the members of each cluster, we see that with GMM clustering, 3 clusters were clearly formed from the source IP addreses.

Cluster 0 consists of mostly 192.168.(21–27) subnets with a couple of possible outliers / anomalies.

Cluster 1 consists of mostly 192.168.(200+) subnets. This cluster is not so tight, have a sparsely populated range of third octet which seem to have possible anomalies.

Cluster 2 consists of only two IP addresses which clearly seem to be anomalies, when compared to the cluster 0 and 1 addresses.

4. Find Anomalies/Outliers from the clusters of IP addresses

We use score_samples() method of GMM to estimate the density at the location of each data point in the clusters. The greater the score, higher the density of the cluster at the location of that data point. Any instance located in a low-density region can be considered an anomaly. We can define a density threshold of say 4%, and consider all the data lying in areas of density below 4th percentile of the range of densities values, as anomalies. This threshold is arbitrary and we can choose it according to our discretion.

We have identified eight IP addresses as anomalies in the set of source IP addresses.

We plot the clusters again, this time marking anomaly data points.

The cluster diagram clearly shows anomalies (in red ) lying far away from the tight cluster groups. Also if you compare the anomalies IP addresses with those commonly found in the clusters, you can see striking differences.

For example, many anomalies end with last octet in range >250 ( i.e towards the end of the octet range 0–255 ). Both addresses of cluster 2 -172.19.2.66 and 0.0.0.0 are marked anomalies. This is expected looking their values which stand out from 192.168. IP addresses in cluster 0 and 1.
It is the same case with 192.168.169.129 whose third octet 169 stands apart from the observed range.

So, we have successfully used Gaussian Mixture Modeling to cluster the source IP addresses in the packet data and also extracted the possible anomalies/outliers in the addresses used.

5. Repeat the same exercise with destination IP addresses and see how GMM fares.

Here, we repeat the same steps for destination IP addresses, that we already followed for source IP addresses.
First we plot the IP address data using the two corresponding PCA components.

The destination IP addresses clearly seem to be more ellipsoidal with at least two long clusters at the top and bottom with similar shape and orientations.

Below we estimate the optimum number of clusters to be 4 and train accordingly with the destination IP data.

Similar to what we saw with source IP data in the preceding sections, we can predict the clusters for each destination IP address and plot the same as below.

We see GMM has decently clustered the IP addresses in to three large clusters and one small cluster (cl 2 ).

Important to note that GMM has respected the ellipsoidal shapes and orientation of the clusters well. This is what differentiates GMM from K-Means which assumes clusters to be spherical ( which we will see subsequently below ).

Then we analyze each cluster to see how similar the IP addresses are in each.

Cluster 0 appears very scattered with addresses from a wide range of subnets.
We can also see this from the clusters plot above. Looks like it will contain a lot of anomalies.

Cluster 1 is very tight containing almost only 192.168.[21–28] subnet addresses. This cluster is the most populated from all the four clusters. Only one outlier — 192.168.10.* we can see sitting far in the cluster plot too.

Cluster 2 is small, very tight containing 173.194.43 subnet addresses. This cluster is the least populated from all the four clusters. One outlier 183.182.82.14 is seen which is also seen standing out in the cluster plot above.

Cluster 3 is tight containing mostly 192.168.[>200] subnet addresses.

So with GMM clustering, we saw the above four clusters clearly formed from the destination IP addresses.

We then separate out anomalies from the data, using the same densities threshold comparison, that we did for source IP address analysis.

We plot the 107 anomalies detected, along with the clusters as seen below.

We can see that all anomalies are part of the widely scattered cluster 0. We can see their locations plotted above in red.

So, we have successfully clustered both source ip addresses and destination ip addresses using GMM and also detected anomalies in the process.

6. Try K-Means Clustering On Destination IP addresses and compare with GMM.

First, we estimate an optimum number of clusters by trying KMeans with different number of clusters and checking the ‘elbow’ point in the graph between cluster tightness and the number of clusters.

There is no clear one elbow point. We will choose 5 as the optimum number of clusters and go ahead with clustering by K-Means fit_predict on the data points.

We plot the clusters identified by K-Means.

We can see that the clustering done by K-Means is different from that by GMM.

  1. The ellipsoidal data points at the bottom has been split in to 3 clusters while GMM clustered them as a single big cluster.
  2. The clusters don’t seem to be clearly split and contains mixed data points from the scattered data points.

Overall the clustering is not as good and clearly demarcated as done by GMM. This should be because K-Means is based on cluster center method which suits more to spherical blobs of data and not ellipsoidal shapes of different sizes and orientations.

Looking at the cluster 0 first octet itself, we can see it contains mix of subnets. Though majorly it consists of 192.168, it also contains 208. , 195. etc.

K-Means has broken a homogenous cluster of 192.168.[21–28] subnet IP addresses in to three smaller clusters and also mixed them up with other dissimilar IP addresses from the scattered set.

But if you recall, GMM had finely segregated all 192.168.[21–28] subnet addresses in to one cluster (Cluster 1) and did not mix up other IP addresses. So clustering by GMM has delivered better results than K-Means for the ellipsoidal shaped groups as expected.

Conclusion

We used both GMM and K-Means clustering algorithms to cluster IP addresses found in the data set.

We saw that GMM delivered better results particularly for destination IP addresses by clearly demarcating the ellipsoidal data blobs than K-Means.

In the process, we also detected anomalies within the IP addresses used.

Thank You !
Hope this was interesting and helped with a bit of learning !
If you loved what you have just read, would you be able to buy me a cup of coffee? . It’s okay if you can’t right now, you can please clap as it will encourage me to write more on similar learnings !

--

--