K-means Clustering Algorithm and Network Intrusion Detection System

Published in

Nerd For Tech

6 min readAug 10, 2021

Objective :

Intrusion detection (ID) is a kind of security management system for computers and networks. There are many approaches and methods used in ID. Each approach has merits and demerits. We’ll use the K-means approach for the Intrusion Detection System.

K-means Algorithm :

K-means algorithm is an iterative algorithm that tries to partition the dataset into ‘k’ pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.

The way the k-means algorithm works is as follows:

Specify the number of clusters K.
Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
Keep iterating until there is no change to the centroids i.e. assignment of data points to clusters isn’t changing.

Compute the sum of the squared distance between data points and all centroids.
Assign each data point to the closest cluster (centroid).
Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.

Working of IDS :

IDS technology based on tracking process can be categorized into two approaches:

Abuse/Signature detection: This technology searches for signature attacks and known signatures in network traffic and are used as a reference to detect future attacks. Regularly updated databases are usually used to store signatures of known attacks. The way this technology controls intrusion detection is similar to antivirus software. The advantage of this type of detection is that it can accurately and efficiently detect known attacks.

Anomaly detection: This technology is based on tracking traffic anomalies. The gap between traffic is monitored and regular profiles are measured. Different implementations of this technology have been reserved based on metrics used to measure the deviation of traffic profiles. The advantage of this detection type is that it is well suited to detect unknown attacks.

Network-based IDS (NIDS):

NIDS is a network approach that collects data directly from a network monitored as a packet instead of collecting data from a particular host / agent. Most NIDS are a free and easy-to-use operating system. Network based IDS offers advantages such as low cost of ownership, easier, placement, network attack detection, evidence preservation, real-time tracking and rapid response, and detection of failed attacks.

Our Approach :

Clustering, based on distance measurements performed on objects, and classifying objects (invasions) into clusters. Unlike classification, classification, because there is no information about the label of learning data, is an unattended learning process. For anomalous detection, we can use welding and in-depth analysis to guide the ID model. Measurement of distance or similarity plays an important role in collecting observations into homogeneous groups. Jacquard affinity measurement, the longest common order scale (LCS), is important that the event is to awaken the size to determine if normal or abnormal. Euclidean distance is approximately two vectors X and Y in space Euclidean n-dimensions, the size of the distance widely used for vector space. Euclidean distance can be defined as the square root of the total difference of the same vector dimension. Finally, grouping and classification algorithms need to be channeled effectively, massively, it possible to handle dimension of network data and heterogeneity.

The steps involved in a K-means algorithm are given consequently:

K points denoting the data to be clustered are placed into the space. These points denote the primary group centroids.
The data are assigned to the group that is adjacent to the centroid.

3. The positions of all the K centroids are recalculated as soon as all the data are assigned.

4. Repeat steps 2 and 3 until the centroid unchanged.

KDDCup 99 Dataset :

The evaluation of any intrusion detection network data is extremely difficult mainly due t obtaining proper labeling of network connection sample cannot be gotten for intrusion detection dataset is used as the sample to verity the performance of the misuse detection model. The KDDCup’99 dataset Columbia University, was arranged from intrusion military network environment at the DARPA in network connections obtained from a sniffer network traffic using the TCP dump format. The period is seven weeks.

The data set includes 41 features classifying the data records into normal or a type of attacks. The features consist of 34 types of numeric features and 7 types of symbolic features, according to different properties of attack.

Pre-Processing :

KDDCUP 99 data set is pre-processed in order to make it suitable for the data mining learning algorithm. Pre-processing is performed for the following reasons. Each record in the dataset consists of categorical as well as numeric features. Textual (plain) data is used for categorical features. K-means algorithm needs numeric data (either discrete or continuous). The first step in pre-processing is to covert this categorical feature attributes to numeric attributes. For converting symbols into numerical form, an integer code is assigned to each symbol. For instance, in the case of protocol type feature, 0 is assigned to tcp, 1 to udp, and 2 to the icmp symbol and so on. The dataset contains three categorical attributes while the rest of the thirty eight attributes are numeric. Every category of an attribute is assigned a specific number. We have used K-means to define normal and attacks in the system. They need specific format so we have converted the dataset to K-means compatible format.

Experimental Result :

K-means algorithm is used to generate heterogeneous dataset to nearly homogeneous dataset. The clustering results of K-means algorithm are described from table III to table VIII:

By analyzing the clustering results, the characteristics of Denial of Service (DoS) attacks are mostly related to themselves in cluster-3. And then, it is closely similar to the nature of Probe attacks in cluster-1. Probe attacks are also mostly related to DoS attacks in cluster-1. And then, it is nearly same with the nature of Normal by looking in cluster-5. Normal is mostly similar nature with User-to-Root attacks and Remote-to-Local attacks by studying in cluster-4. And then, Normal is related to Probe by studying cluster-2 and cluster-5. Normal is related to all attacks by looking in all 5 clusters because attacks mimic to normal behavior in intrusions. Then we apply Random Forest algorithm to know the intrusions and normal traffic. The performance of attacks categories with Random Forest algorithm in 5 clusters of K-means can be seen from table IX to table XIV. The Precision and Recall of the normal and attacks detection are good and the false positive rate is nearly zero.

Conclusion :

The comparative analysis with hybrid machine learning technique to detect Denial of Service (DoS) attacks, Probing (Probe) attacks, User-to-Root (U2R) attacks and Remote-to-Local (R2L) attacks. We can know the similar nature of attack group by using K-means algorithm. The experiments show that, KDDCup 99 dataset can be applied as an effective benchmark dataset to help researchers compare different intrusion detection models and application of the K-means Clustering Algorithm in the Network Security Domain.

References :

https://www.researchgate.net/publication/324155493_An_Analysis_of_K-means_Algorithm_Based_Network_Intrusion_Detection_System