Building an Intrusion Detection System using KDD Cup’99 Dataset

Saurabh Singh
Analytics Vidhya
Published in
22 min readJan 12, 2020
Image Source: Google

A network intrusion is any unauthorized activity on a computer network. Software to detect network intrusions aims at protecting a computer network from unauthorized users, including perhaps insiders.

Problem Statement:-

In this project, we will build a network intrusion detector, a predictive model capable of distinguishing between ‘’bad’’ connections, called as intrusions or attacks, and ‘’good’’ or normal connections. This dataset includes a wide variety of intrusions simulated in a military network environment. The data used to build the Intrusion detector was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. For more details on how the data was collected, and the description of various features involved in the data, you can visit the link below:

Building an Intrusion Detection System:

(I) Importing Data:

data = pd.read_csv('kddcup.data_10_percent_corrected', names = features, header=None)print('The no of data points are:',data.shape[0])
print('='*40)
print('The no of features are:',data.shape[1])
print('='*40)
print('Some of the features are:',features[:10])
The no of data points are: 494021
========================================
The no of features are: 42
========================================
Some of the features are: ['duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot']

The dataset consists of 4,94,021 data points and 42 features. The list of features and their details can be obtained from the below link:

http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names

The different categories of output label are shown below:

output = data['intrusion_type'].values
labels = set(output)
print('The different type of output labels are:',labels)
print('='*125)
print('No. of different output labels are:', len(labels))
The different type of output labels are: {'neptune.', 'multihop.', 'warezmaster.', 'portsweep.', 'smurf.', 'land.', 'teardrop.', 'nmap.', 'guess_passwd.', 'normal.', 'perl.', 'spy.', 'satan.', 'ftp_write.', 'loadmodule.', 'pod.', 'back.', 'buffer_overflow.', 'phf.', 'rootkit.', 'warezclient.', 'imap.', 'ipsweep.'}
====================================================================
No. of different output labels are: 23

As we can see, the data has a total of 23 different output classes, out of which class “normal” represents the good connections, whereas the remaining 22 classes represent different types of bad connections.

(II) Data Cleaning:-

An important step involved while dealing with datasets is to clean the available data before using it for Data Analysis and building models. Some important steps involved in the data cleaning process are removing/imputing NULL values and removing duplicates from the dataset.

Checking for NULL values:-

print('Null values in dataset are',len(data[data.isnull().any(1)]))Null values in the dataset are:  0

Checking for DUPLICATE values:-

data.drop_duplicates(subset=features, keep='first', inplace = True)data.shape
(145586, 42)
data.to_pickle('data.pkl')

In the last part, we have stored the data into a pickle file. This is done so that in case if we need the original data again, we can directly load it from the pickle file, without needing to go through the process of importing the raw data and cleaning it again.

(III) Exploratory Data Analysis:-

Exploratory data analysis (EDA) is an approach for analyzing the dataset to summarize their main characteristics, often with visual methods. A statistical model may or may not be used, but primarily EDA is for seeing what the data can tell us beyond the formal modeling.

Below, we have used some python libraries like matplotlib, pandas, and seaborn for performing EDA. We have also built some utility functions that are used to create plots for Bi-variate Analysis.

plt.figure(figsize=(20,15))
class_distribution = data['intrusion_type'].value_counts()
class_distribution.plot(kind='bar')
plt.xlabel('Class')
plt.ylabel('Data points per Class')
plt.title('Distribution of yi in train data')
plt.grid()
plt.show()
# ref: arg sort https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html# -(train_class_distribution.values): the minus sign will give us in decreasing ordersorted_yi = np.argsort(-class_distribution.values)
for i in sorted_yi:
print('Number of data points in class', i+1,':', class_distribution.values[i], '(', np.round((class_distribution.values[i]/data.shape[0]*100), 3), '%)')
Number of data points in class normal : 87832 ( 60.33 %)
Number of data points in class neptune : 51820 ( 35.594 %)
Number of data points in class back : 968 ( 0.665 %)
Number of data points in class teardrop : 918 ( 0.631 %)
Number of data points in class satan : 906 ( 0.622 %)
Number of data points in class warezclient : 893 ( 0.613 %)
Number of data points in class ipsweep : 651 ( 0.447 %)
Number of data points in class smurf : 641 ( 0.44 %)
Number of data points in class portsweep : 416 ( 0.286 %)
Number of data points in class pod : 206 ( 0.141 %)
Number of data points in class nmap : 158 ( 0.109 %)
Number of data points in class guess_passwd : 53 ( 0.036 %)
Number of data points in class buffer_overflow : 30 ( 0.021 %)
Number of data points in class warezmaster : 20 ( 0.014 %)
Number of data points in class land : 19 ( 0.013 %)
Number of data points in class imap : 12 ( 0.008 %)
Number of data points in class rootkit : 10 ( 0.007 %)
Number of data points in class loadmodule : 9 ( 0.006 %)
Number of data points in class ftp_write : 8 ( 0.005 %)
Number of data points in class multihop : 7 ( 0.005 %)
Number of data points in class phf : 4 ( 0.003 %)
Number of data points in class perl: 3 ( 0.002 %)
Number of data points in class spy : 2 ( 0.001 %)

Observations:-

  • Most of the data points are from “normal” (good connections) category which is around 60.33 %.
  • In the categories that belong to bad connections, class “neptune.” (35.594 %) and “back.” (0.665 %) have the highest no. of data points.
  • Classes “rootkit.”, “load_module.”, “ftp_write.”, “multi-hop.”, “phf.”, “perl.” and “spy.” have the least no. of data points with less than 10 data points per class.
  • This dataset is highly imbalanced, thus we will need to build a model which should be able to classify the points belonging to these different classes accurately.

Performance metrics for the problem:-

  • We will use the CONFUSION MATRIX as it will help us to determine how well a model has been able to classify the data points belonging to each of the 23 classes.
  • Along with the confusion matrix, we will also calculate precision, recall and weighted f1-score to determine the best model.

Another important metric:

For this problem, we want our FPR to be as low as possible. This is because a “Normal” connection getting dropped because of getting misclassified as a “Bad” connection is less severe than a “Bad” connection getting misclassified as a “Normal” connection, which may result in a security threat.

  • For this Intrusion Detection problem, the TPR and FPR can be described as below:-
  • Thus, while applying different ML techniques to the data, along with calculating the confusion matrix and f1-score, we will also calculate the TPR and FPR scores that will help us to choose the best model.

Univariate Analysis:-

  1. Duration:-
import seaborn as sns
plt.figure(figsize=(20,16))
sns.set(style="whitegrid")
ax = sns.violinplot(x="intrusion_type", y="duration", data=data, fliersize=None)
plt.xticks(
rotation=45,
horizontalalignment='right',
fontweight='light',
fontsize='x-large'
)
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]), <a list of 23 Text xticklabel objects>)
  • The Univariate analysis using boxplots and violin plots do not give us any clear and satisfactory results.
  • Thus, we will go with pair plots for Bi-variate Analysis or we can also go with PCA/TSNE to reduce the no. of dimensions and perform BiVariate / Tri-Variate Analysis.

Bivariate Analysis using pairplot:-

def pairplot(data, label, features=[]):'''
This function creates pairplot taking 4 features from our dataset as default parameters along with the output variable
'''
sns.pairplot(data, hue=label, height=4, diag_kind='hist', vars=features, plot_kws={'alpha':0.6, 's':80, 'edgecolor':'k'})

The above function takes 4 features from our dataset and plots 16 Bivariate plots with different combinations of 2 features in each of the 16 plots as shown below.

Similarly, many such pairplots with different combinations of the features can be plotted.

Observation from Pairplots:-

  • None of the pair plots are able to show any linear separability/ almost linear separability between the different output categories.

TSNE for Bivariate Analysis:-

(t-SNE) t-Distributed Stochastic Neighbor Embedding is a non-linear dimensionality reduction algorithm used for exploring high-dimensional data. It maps multi-dimensional data to two or more dimensions suitable for human observation.

Below we have written a function that can be used to plot the TSNE plots by specifying the values for perplexity and no. of iterations as per our choice.

def tsne_func(data, label, no_components, perplexity_value, n_iter_value):'''
This function applies TSNE on the original dataset with no_components, perplexity_value, n_iter_value as the TSNE parameters
and transforms the original dataset into TSNE transformed feature space with the tsne dataset containing number of features
equal to the value specified for no_components and also plots the scatter plot of the transformed data points along with
their class label
'''
print('TSNE with perplexity={} and no. of iterations={}'.format(perplexity_value, n_iter_value))tsne = TSNE(n_components=no_components, perplexity=perplexity_value, n_iter=n_iter_value)tsne_df1 = tsne.fit_transform(data)print(tsne_df1.shape)tsne_df1 = np.vstack((tsne_df1.T, Y)).Ttsne_data1 = pd.DataFrame(data=tsne_df1, columns=['feature1', 'feature2', 'Output'])sns.FacetGrid(tsne_data1, hue='Output', size=6).map(plt.scatter, 'feature1', 'feature2').add_legend()plt.show()
  1. TSNE plot(2D) with perplexity_value=100, n_iter_value=500:

2. TSNE plot(2D) with perplexity_value=50, n_iter_value=1000:

Observations:-

From the above 2 plots, we can conclude that there is no linear separability between any 2 or more categories in the TSNE transformed 2-D space.

(V) Train-Test Split:-

Below, we have performed the train-test split, where we have divided the entire dataset into 2 parts, where the training data is 75% of our total data and the test data consists of points belonging to the remaining 25% of the data.

from sklearn.model_selection import train_test_splitX_train, X_test, Y_train, Y_test = train_test_split(data.drop('intrusion_type', axis=1), data['intrusion_type'], stratify=data['intrusion_type'], test_size=0.25)print('Train data')
print(X_train.shape)
print(Y_train.shape)
print('='*20)
print('Test data')
print(X_test.shape)
print(Y_test.shape)
Train data
(109189, 41)
(109189,)
====================
Test data
(36397, 41)
(36397,)

The reason why we have not split the train data further into train and cross-validation sets is that we will be using K-fold cross-validation on the train data using Grid-search CV while building our models.

(VI) Preprocessing features in our data:-

  1. Vectorizing categorical data using One-hot encoding:-

Our dataset has 3 categorical features namely protocol, service & flag which we will vectorize using one-hot encoding as shown below.

1. Protocol_type:-
--------------------
protocol = list(X_train['protocol_type'].values)
protocol = list(set(protocol))
print('Protocol types are:', protocol)
Protocol types are: ['udp', 'tcp', 'icmp']from sklearn.feature_extraction.text import CountVectorizer
one_hot = CountVectorizer(vocabulary=protocol, binary=True)
train_protocol = one_hot.fit_transform(X_train['protocol_type'].values)
test_protocol = one_hot.transform(X_test['protocol_type'].values)
print(train_protocol[1].toarray())
print(train_protocol.shape)
[[0 1 0]]
(109189, 3)

Similarly, we also apply one-hot encoding on service and flag features.

2. Standardizing the features:-

Data standardization is the process of rescaling one or more attributes so that they have a mean value of 0 and a standard deviation of 1.

Below, is a function that takes one of the features as a parameter and applies standardization on it.

def feature_scaling(X_train, X_test, feature_name):

'''
This function performs standardisation on the features
'''
scaler = StandardScaler()
scaler1 = scaler.fit_transform(X_train[feature_name].values.reshape(-1,1))
scaler2 = scaler.transform(X_test[feature_name].values.reshape(-1,1))

return scaler1, scaler2
1. Duration :-
---------------
duration1, duration2 = feature_scaling(X_train, X_test, 'duration')print(duration1[1])[-0.10631]2. src_bytes :-
------------------
src_bytes1, src_bytes2 = feature_scaling(X_train, X_test, 'src_bytes')print(src_bytes1[1])[-0.02721124]2. dst_bytes :-
------------------
dst_bytes1, dst_bytes2 = feature_scaling(X_train, X_test, 'dst_bytes')print(dst_bytes1[1])[-0.03568432]

Similarly, we will apply Standardization on other continuous features in our dataset.

(VII) Merging the features to prepare final data:-

After vectorizing and standardizing the features, we will merge these features to obtain our final training and test datasets that will be fed to our ML models for training and evaluation purposes.

from scipy.sparse import hstackX_train_1 = hstack((duration1, train_protocol, train_service, train_flag, src_bytes1, dst_bytes1, land1.T, wrong_fragment1, urgent1, hot1, num_failed_logins1, logged_in1.T, num_compromised1, root_shell1, su_attempted1, num_root1, num_file_creations1, num_shells1, num_access_files1, is_host_login1.T, is_guest_login1.T, count1, srv_count1, serror_rate1, srv_serror_rate1, rerror_rate1, srv_rerror_rate1, same_srv_rate1, diff_srv_rate1, srv_diff_host_rate1, dst_host_count1, dst_host_srv_count1, dst_host_same_srv_rate1, dst_host_diff_srv_rate1, dst_host_same_src_port_rate1, dst_host_srv_diff_host_rate1, dst_host_serror_rate1, dst_host_srv_serror_rate1, dst_host_rerror_rate1, dst_host_srv_rerror_rate1))X_train_1.shape(109189, 116)--------------------------------------------------------------------X_test_1 = hstack((duration2, test_protocol, test_service, test_flag, src_bytes2, dst_bytes2, land2.T, wrong_fragment2, urgent2, hot2, num_failed_logins2, logged_in2.T, num_compromised2, root_shell2, su_attempted2, num_root2, num_file_creations2, num_shells2, num_access_files2, is_host_login2.T, is_guest_login2.T, count2, srv_count2, serror_rate2, srv_serror_rate2, rerror_rate2, srv_rerror_rate2, same_srv_rate2, diff_srv_rate2, srv_diff_host_rate2, dst_host_count2, dst_host_srv_count2, dst_host_same_srv_rate2, dst_host_diff_srv_rate2, dst_host_same_src_port_rate2, dst_host_srv_diff_host_rate2, dst_host_serror_rate2, dst_host_srv_serror_rate2, dst_host_rerror_rate2, dst_host_srv_rerror_rate2))X_test_1.shape(36,397, 116)

Further Approach to our problem:-

(i) We will apply below classifiers on our dataset and evaluate their performances:-

1. Naive Bayes
2. Logistic Regression
3. SVM
4. Decision Tree
5. Random Forest
6. GBDT / XGBoost

(ii) Based on the performance metric scores we will obtain from the above classifiers, we will apply below feature engineering techniques on our dataset to get some additional features:

  • 1. Clustering features:- We will apply clustering on our dataset and add the clustering values as an additional feature to our dataset.
  • 2. PCA transformed features:- We will apply PCA on the dataset and will add the top 5 PCA features as additional features to our dataset.
  • 3. Feature engineering using existing features:- We will create new features from the data as shown below:
    (i) Adding 2 features: (e.g. new_feature_1 = src_bytes + dst_bytes)
    (ii)Subtracting 2 features, (e.g. new_feature_2 = abs(src_bytes -dst_bytes).

(iii) We will then apply the best performing classifiers from dataset 1 on dataset 2 and evaluate their performance.

(VIII) Applying Machine Learning Models:-

Below, are some functions that will be used for different purposes during the model building stage.

Function_1 :-

The below function prints a confusion matrix heatmap that will help us to determine how well the model has been able to classify the data points belonging to different categories.

import datetime as dt
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.externals import joblib
def confusion_matrix_func(Y_test, y_test_pred):

'''
This function computes the confusion matrix using Predicted and Actual values and plots a confusion matrix heatmap
'''
C = confusion_matrix(Y_test, y_test_pred)
cm_df = pd.DataFrame(C)
labels = ['back', 'butter_overflow', 'loadmodule', 'guess_passwd', 'imap', 'ipsweep', 'warezmaster', 'rootkit',
'multihop', 'neptune', 'nmap', 'normal', 'phf', 'perl', 'pod', 'portsweep', 'ftp_write', 'satan', 'smurf', 'teardrop', 'warezclient', 'land']
plt.figure(figsize=(20,15))
sns.set(font_scale=1.4)
sns.heatmap(cm_df, annot=True, annot_kws={"size":12}, fmt='g', xticklabels=labels, yticklabels=labels)
plt.ylabel('Actual Class')
plt.xlabel('Predicted Class')

plt.show()

Function_2 :-

This function fits the best estimator obtained from Grid Search CV on the train data and predicts the performance on train data as well as on the test data and also calculates the total time taken to fit the model and predict the output.

def model(model_name, X_train, Y_train, X_test, Y_test):'''
Fits the model on train data and predict the performance on train and test data.
'''
print('Fitting the model and prediction on train data:')
start = dt.datetime.now()
model_name.fit(X_train, Y_train)
y_tr_pred = model_name.predict(X_train)
print('Completed')
print('Time taken:',dt.datetime.now()-start)
print('='*50)

results_tr = dict()
y_tr_pred = model_name.predict(X_train)
results_tr['precision'] = precision_score(Y_train, y_tr_pred, average='weighted')
results_tr['recall'] = recall_score(Y_train, y_tr_pred, average='weighted')
results_tr['f1_score'] = f1_score(Y_train, y_tr_pred, average='weighted')

results_test = dict()
print('Prediction on test data:')
start = dt.datetime.now()
y_test_pred = model_name.predict(X_test)
print('Completed')
print('Time taken:',dt.datetime.now()-start)
print('='*50)

print('Performance metrics:')
print('='*50)
print('Confusion Matrix is:')
confusion_matrix_func(Y_test, y_test_pred)
print('='*50)
results_test['precision'] = precision_score(Y_test, y_test_pred, average='weighted')
print('Precision score is:')
print(precision_score(Y_test, y_test_pred, average='weighted'))
print('='*50)
results_test['recall'] = recall_score(Y_test, y_test_pred, average='weighted')
print('Recall score is:')
print(recall_score(Y_test, y_test_pred, average='weighted'))
print('='*50)
results_test['f1_score'] = f1_score(Y_test, y_test_pred, average='weighted')
print('F1-score is:')
print(f1_score(Y_test, y_test_pred, average='weighted'))
# add the trained model to the results
results_test['model'] = model

return results_tr, results_test

Function_3 :-

def print_grid_search_attributes(model):

'''
This function prints all the grid search attributes
'''


print('---------------------------')
print('| Best Estimator |')
print('---------------------------')
print('\n\t{}\n'.format(model.best_estimator_))
# parameters that gave best results while performing grid searchprint('---------------------------')
print('| Best parameters |')
print('---------------------------')
print('\tParameters of best estimator : \n\n\t{}\n'.format(model.best_params_))
# number of cross validation splitsprint('----------------------------------')
print('| No of CrossValidation sets |')
print('----------------------------------')
print('\n\tTotal number of cross validation sets: {}\n'.format(model.n_splits_))
# Average cross validated score of the best estimator, from the Grid Searchprint('---------------------------')
print('| Best Score |')
print('---------------------------')
print('\n\tAverage Cross Validate scores of best estimator : \n\n\t{}\n'.format(model.best_score_))

The above function prints the grid-search related attributes like no_of_splits, best_estimator, best_parameters, and best_score.

def tpr_fpr_func(Y_tr, Y_pred):'''
This function computes the TPR and FPR scores using the actual and predicetd values.
'''
results = dict()Y_tr = Y_tr.to_list()
tp = 0; fp = 0; positives = 0; negatives = 0; length = len(Y_tr)
for i in range(len(Y_tr)):
if Y_tr[i]=='normal.':
positives += 1
else:
negatives += 1

for i in range(len(Y_pred)):
if Y_tr[i]=='normal.' and Y_pred[i]=='normal.':
tp += 1
elif Y_tr[i]!='normal.' and Y_pred[i]=='normal.':
fp += 1

tpr = tp/positives
fpr = fp/negatives

results['tp'] = tp; results['tpr'] = tpr; results['fp'] = fp; results['fpr'] = fpr

return results

The above function computes TP, FP, TPR, FPR using the actual and predicted values.

Model_1:- Gaussian Naive Bayes

The 1st model that we will apply to our dataset is the Gaussian Naive Bayes model. The hyperparameter involved in the model is “var_smoothing” which is also termed as “Laplace Smoothing” which is used to avoid the numerical instability caused due to very small feature values.

hyperparameter = {'var_smoothing':[10**x for x in range(-9,3)]}from sklearn.naive_bayes import GaussianNBnb = GaussianNB()
nb_grid = GridSearchCV(nb, param_grid=hyperparameter, cv=5, verbose=1, n_jobs=-1)
nb_grid_results = model(nb_grid, X_train_1.toarray(), Y_train, X_test_1.toarray(), Y_test)--------------------------------------------------------------------Fitting the model and prediction on train data:
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 8.5s
[Parallel(n_jobs=-1)]: Done 60 out of 60 | elapsed: 12.1s finished
Completed
Time taken: 0:00:17.167590
==================================================
Prediction on test data:
Completed
Time taken: 0:00:00.712164
==================================================
Performance metrics:
==================================================
Precision score is:
0.9637974665033534
==================================================
Recall score is:
0.974201170426134
==================================================
F1-score is:
0.9679678294214985
NB Confusion Matrix
print_grid_search_attributes(nb_grid)-----------------------------------------------------------------------------------------------
| Best Estimator |
---------------------------
GaussianNB(priors=None, var_smoothing=10)---------------------------
| Best parameters |
---------------------------
Parameters of best estimator :
{'var_smoothing': 10}----------------------------------
| No of CrossValidation sets |
----------------------------------
Total number of cross validation sets: 5---------------------------
| Best Score |
---------------------------
Average Cross Validate scores of best estimator : 0.9729551511599154

The final results obtained from NB Classifier are as below:

Train results:-

tpr_fpr_train{'fp': 2225,
'fpr': 0.051367886413482625,
'tp': 65483,
'tpr': 0.9940644260254425}
nb_grid_results_tr{'f1_score': 0.9671813437943309,
'precision': 0.9632732426450655,
'recall': 0.9738984696260612}

Test results:-

tpr_fpr_test{'fp': 710, 'fpr': 0.04917238035875061, 'tp': 21814, 'tpr': 0.9934420256853994}nb_grid_results_test{'f1_score': 0.9679678294214985,
'model': <function __main__.model(model_name, X_train, Y_train, X_test, Y_test)>,
'precision': 0.9637974665033534,
'recall': 0.974201170426134}

Observations from NB Classifier:-

  • The test data has 36397 total no. of points. Out of these, 21958 are points belonging to Normal connections, and the remaining 14439 points belong to Bad connections.
  • Out of the 21958 Normal connection points, 21814 (99.34%) were classified correctly by the Naive Bayes Classifier.
  • Out of the 14439 points belonging to Bad connections, class Neptune has the highest no. of data points 12955, out of which 12954(99.99%) are classified correctly.
  • Out of the classes having very less no. of data points, class guess_passwd was classified with (12/13) 92.30% accuracy, class butter_overflow with (6/7) 85.71% accuracy,class warezmaster with(4/5) 80% accuracy, class land with(4/5) 80% accuracy, class imap with(0/3) 0% accuracy, class loadmodule with (1/2) 50% accuracy, class rootkit with(0/2) 0% accuracy, class multihop with (0/2) 0% accuracy, class ftp_write with(0/2) 0% accuracy, and classes phf and perl with both (1/1) 100% accuracy.
  • Although the Naive Bayes Classifier was able to classify points with a high f1 score of 0.9670, we can use more advanced Non-linear and linear classifiers ahead and we will try to classify the Normal and bad connections with a higher f1-score.
  • false positives: 710
  • false-positive rate: 0.049
  • true positive: 21814
  • true positive rate: 0.9934
  • As the train and test scores are almost similar and high, we can say that the model is “NEITHER OVERFITTING NOR UNDERFITTING”.

Model_2:- Decision Tree Classifier

A Decision tree Classifier is a non-linear ML classifier which uses multiple lines/planes/hyperplanes to make decisions and classify point belonging to different categories similar to an if-else statement.

hyperparameter = {'max_depth':[5, 10, 20, 50, 100, 500], 'min_samples_split':[5, 10, 100, 500]}from sklearn.tree import DecisionTreeClassifierdecision_tree = DecisionTreeClassifier(criterion='gini', splitter='best',class_weight='balanced')
decision_tree_grid = GridSearchCV(decision_tree, param_grid=hyperparameter, cv=3, verbose=1, n_jobs=-1)
decision_tree_grid_results = model(decision_tree_grid, X_train_1.toarray(), Y_train, X_test_1.toarray(), Y_test)--------------------------------------------------------------------Fitting the model and prediction on train data:
Fitting 3 folds for each of 24 candidates, totalling 72 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 17.8s
[Parallel(n_jobs=-1)]: Done 72 out of 72 | elapsed: 36.3s finished
Completed
Time taken: 0:00:36.574308
==================================================
Prediction on test data:
Completed
Time taken: 0:00:00.018077
==================================================
Performance metrics:
==================================================
Precision score is:
0.9986638296866037
==================================================
Recall score is:
0.9985713108223205
==================================================
F1-score is:
0.9986068375429693
DT_1 Confusion Matrix
print_grid_search_attributes(decision_tree_grid)-----------------------------------------------------------------------------------------------
| Best Estimator |
---------------------------
DecisionTreeClassifier(class_weight='balanced', criterion='gini', max_depth=50,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=5,
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')
---------------------------
| Best parameters |
---------------------------
Parameters of best estimator :
{'max_depth': 50, 'min_samples_split': 5}----------------------------------
| No of CrossValidation sets |
----------------------------------
Total number of cross validation sets: 3---------------------------
| Best Score |
---------------------------
Average Cross Validate scores of best estimator : 0.9983056901336215

The final results obtained from DT Classifier are as below:

Train results:-

decision_tree_grid_results_tr{'f1_score': 0.9997583211262271,
'precision': 0.9997729384543836,
'recall': 0.9997527223438258}
dt_tpr_fpr_train{'fp': 0, 'fpr': 0.0, 'tp': 65853, 'tpr': 0.9996812095819292}

Test results:-

decision_tree_grid_results_test{'f1_score': 0.99860727686375,
'model': <function __main__.model(model_name, X_train, Y_train, X_test, Y_test)>,
'precision': 0.9986657857309603,
'recall': 0.9985713108223205}
dt_tpr_fpr_test{'fp': 19, 'fpr': 0.001315880601149664, 'tp': 21937, 'tpr': 0.9990436287457874}

Observations from DT Classifier:-

  • Out of the 21958 Normal connection points, 21937 (99.90%) were correctly classified by the Decision Tree Classifier.
  • Out of the 14439 points belonging to Bad connections, class Neptune has the highest no. of data points 12955, out of which 12953(99.98%) were classified correctly.
  • Out of the classes having very less no. of data points, class guess_passwd was classified with (13/13) 100% accuracy, class butter_overflow with (6/7) 85.71% accuracy, class warezmaster with(5/5) 100% accuracy, class land with(4/5) 80% accuracy, class imap with(3/3) 100% accuracy, class load module with (0/2) 0% accuracy, class rootkit with(1/2) 50% accuracy, class multihop with (1/2) 50% accuracy, class ftp_write with(0/2) 0% accuracy, class phf with (1/1) 100% accuracy and class perl with (1/1) 100% accuracy.
  • The Decision Tree Classifier was able to classify points with a higher f1 score of 0.9986 compared to all the previous Classifiers.
  • True Positives = 21937
  • TPR = 0.9990
  • False Positives = 19
  • FPR = 0.0013
  • The DT Classifier has the lowest FPR and highest TPR compared to all of the above models.
  • Thus, we can say that a nonlinear ML model like DT is able to learn the pattern from the dataset in a better way compared to linear classifiers like Logistic Regression.

Model_3 :- XGBoost Classifier

An XGBoost Classifier uses the concept of Boosting which uses multiple weak learners (i.e. decision trees of depth 1 or 2) where each new tree learns from the mistake made by the previous weak learner and similarly multiple such weak learners(in multiple of 100’s) combines together to form a strong final model.

hyperparameter = {'max_depth':[2, 3, 5, 7, 10], 'n_estimators': [10, 50, 100, 200, 500]}from xgboost import XGBClassifierxgb = XGBClassifier(objective='multi:softprob')
xgb_grid = GridSearchCV(xgb, param_grid=hyperparameter, cv=3, verbose=1, n_jobs=-1)
xgb_grid_results = model(xgb_grid, X_train_1.toarray(), Y_train, X_test_1.toarray(), Y_test)--------------------------------------------------------------------Fitting the model and prediction on train data:
Fitting 3 folds for each of 25 candidates, totalling 75 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 51.3min
[Parallel(n_jobs=-1)]: Done 75 out of 75 | elapsed: 143.6min finished
Completed
Time taken: 0:19:06.398788
==================================================
Prediction on test data:
Completed
Time taken: 0:00:10.396326
==================================================
Performance metrics:
==================================================
Precision score is:
0.9993660928288938
==================================================
Recall score is:
0.9995054537461878
==================================================
F1-score is:
0.9994282483855045
XGB_1 Confusion Matrix
print_grid_search_attributes(xgb_grid)-----------------------------------------------------------------------------------------------
| Best Estimator |
---------------------------
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=5,
min_child_weight=1, missing=None, n_estimators=500, n_jobs=1,
nthread=None, objective='multi:softprob', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=None, subsample=1, verbosity=1)
---------------------------
| Best parameters |
---------------------------
Parameters of best estimator :
{'max_depth': 5, 'n_estimators': 500}----------------------------------
| No of CrossValidation sets |
----------------------------------
Total number of cross validation sets: 3---------------------------
| Best Score |
---------------------------
Average Cross Validate scores of best estimator : 0.9992123748729268

The final results obtained from XGBoost Classifier are as below:

Train results:-

xgb_grid_results_tr{'f1_score': 0.9999816952580419,
'precision': 0.9999817609521351,
'recall': 0.9999816831365796}
xgb_tpr_fpr_train{'fp': 0, 'fpr': 0.0, 'tp': 65873, 'tpr': 0.9999848195039014}

Test results:-

xgb_grid_results_test{'f1_score': 0.9994282483855045,
'model': <function __main__.model(model_name, X_train, Y_train, X_test, Y_test)>,
'precision': 0.9993660928288938,
'recall': 0.9995054537461878}
xgb_tpr_fpr_test{'fp': 12,
'fpr': 0.0008310824849366299,
'tp': 21955,
'tpr': 0.9998633755351125}

Observations from XGBoost Classifier:-

  • Out of the 21958 Normal connection points, 21955 (99.98%) were correctly classified by the XGBoost Classifier.
  • Out of the 17409 points belonging to Bad connections, class Neptune has the highest no. of data points 12955, out of which 12955(100.0%) were classified correctly.
  • Out of the classes having very less no. of data points, class guess_passwd was classified with (12/13) 92.30% accuracy, class butter_overflow with (7/7) 100.0% accuracy, class warezmaster with(4/5) 80% accuracy, class land with(5/5) 100% accuracy, class imap with(3/3) 100% accuracy, class loadmodule with (0/2) 0% accuracy, class rootkit with(0/2) 0% accuracy, class multihop with (0/2) 0% accuracy, class ftp_write with(1/2) 50% accuracy, class phf with (1/1) 100% accuracy and class perl with (1/1) 100% accuracy.
  • The XGBoost Classifier was able to classify different classes with the highest f1-score(0.9994) compared to all of the above models.
  • True Positives = 21955
  • TPR = 0.9998
  • False Positives = 12
  • FPR = 0.00083
  • The XGBoost Classifier has the highest TPR and lowest FPR compared to all of the previous models.
  • As the train and test metrics like f1-score, tpr, and for are almost similar, the model is NOT OVERFITTING.

Similarly, we applied the remaining models & below is the summary of their performance:

Results

Observation from ALL of the above classifiers:-

  • If we consider NORMAL connection points as 1 class and points belonging to all the other 22 BAD connection classes as the 2nd class, then XGBoost Classifier is the best classifier as it has a TPR and FPR of 0.9998 and 0.00083.
  • Although the XGBoost classifier had a better f1_score than the RF classifier, if we go into details of the confusion matrix scores, we can observe that both classifiers have performed similarly on the different categories of attacks on our dataset.
  • The RF Classifier has TPR and FPR of 0.9998 and 0.0013.
  • The overall time taken for training + evaluation was less for the RF and DT classifiers compared to the XGBoost classifier.
  • A common pattern shown by all of the classifiers is that classes rootkit, ftp_write, and load module were misclassified as class Normal by most of the classifiers.
  • We will add more features to our dataset and try to improve classifier performance.
  • As DT, RF & XGBoost had the best performance, we will use these 3 classifiers ahead on the existing + feature engineered data.

As the train and test metrics like f1-score, tpr, and fpr have almost similar scores for train and test dataset, the model is NOT OVERFITTING.

(IX) Adding new features:-

  1. Clustering features (using MiniBatchKmeans):

Clustering is an Unsupervised ML technique that groups similar(closer) points into the same cluster and dissimilar(farther) points into different clusters. The reason why we are using clustering in this problem is if the KMeans algorithm ends up grouping points belonging to the same category into the same cluster, then it will allow our model to learn a new feature that will be very important for classifying the test data points.

from sklearn.cluster import MiniBatchKMeans
import numpy as np
kmeans = MiniBatchKMeans(n_clusters=23, random_state=0, batch_size=128, max_iter=100)
kmeans.fit(X_train_1)
MiniBatchKMeans(batch_size=128, compute_labels=True, init='k-means++', init_size=None, max_iter=100, max_no_improvement=10,
n_clusters=23, n_init=3, random_state=0,reassignment_ratio=0.01, tol=0.0, verbose=0)
train_cluster = kmeans.predict(X_train_1)
test_cluster = kmeans.predict(X_test_1)
--------------------------------------------------------------------print('Length of train cluster',len(train_cluster))
print(train_cluster)
Length of train cluster 109189
array([8, 0, 1, ..., 4, 4, 1], dtype=int32)
--------------------------------------------------------------------print('Length of test cluster',len(train_cluster))
print(test_cluster)
Length of test cluster 109189
array([ 1, 22, 8, ..., 0, 17, 8], dtype=int32)

2. PCA features:-

PCA is a dimensionality reduction ML technique that transforms the given d-dimensional dataset into d’-dimensions where each new feature (principal component) is obtained based on the amount of variance or information they carry. We will add the top 5 PCA features to our dataset. (We can add more or less and test if they improve the performance)

from sklearn.decomposition import PCApca = PCA(n_components=5)
pca.fit(X_train_1.toarray())
pca_train = pca.transform(X_train_1.toarray())
pca_test = pca.transform(X_test_1.toarray())
--------------------------------------------------------------------print(pca_train.shape)
print(pca_test.shape)
(109189, 5)
(36397, 5)

3. Additional feature engineering:-

  • We will create new features from the data like:
    (i) Adding 2 existing features, (e.g. new_feature_1 = src_bytes + dst_bytes)
    (ii)Subtracting 2 existing features, (e.g. new_feature_2 = abs(src_bytes — dst_bytes).

(a) src_bytes + dst_bytes

feature_src_dst_1 = src_bytes1 + dst_bytes1
feature_src_dst_2 = src_bytes2 + dst_bytes2
feature_src_dst_1.shape
(109189, 1)

(b) src_bytes — dst_bytes

feature_src_dst_3 = src_bytes1 - dst_bytes1
feature_src_dst_4 = src_bytes2 - dst_bytes2
feature_src_dst_3.shape
(109189, 1)

( c) same_srv_rate + diff_srv_rate :-

feature_5 = same_srv_rate1 + diff_srv_rate1
feature_6 = same_srv_rate2 + diff_srv_rate2
feature_5.shape
(109189, 1)

(d) dst_host_same_srv_rate + dst_host_diff_srv_rate :-

feature_7 = dst_host_same_srv_rate1 + dst_host_diff_srv_rate1
feature_8 = dst_host_same_srv_rate2 + dst_host_diff_srv_rate2
feature_7.shape
(109189, 1)

Adding clustering and PCA features to our dataset with the additional 4 features:-

X_train_2 = hstack((X_train_1, pca_train, train_cluster.T, feature_src_dst_1, feature_src_dst_3, feature_5, feature_7))X_test_2 = hstack((X_test_1, pca_test, test_cluster.T, feature_src_dst_2, feature_src_dst_4, feature_6, feature_8))print('Train data:')
print(X_train_2.shape)
print('='*30)
print('Test data:')
print(X_test_2.shape)
Train data:
(109189, 126)
==============================
Test data:
(36397, 126)

(X) Applying Machine Learning Models:-

We will apply below 3 models on dataset 2 as they were the best performing models on dataset 1:

  1. Decision Tree
  2. Random Forest
  3. XGBoost

Model :- XGBoost Classifier

Below we have applied the XGBoost Classifier on dataset_2 and evaluated the performance.

from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV
hyperparameter = {'max_depth':[2, 3, 5, 7, 10], 'n_estimators': [10, 50, 100, 200, 500]}xgb = XGBClassifier(objective='multi:softprob', n_jobs=-1)
xgb_grid = RandomizedSearchCV(xgb, param_distributions=hyperparameter, cv=3, verbose=1, n_jobs=-1)
xgb_grid_results2 = model(xgb_grid, X_train_2.toarray(), Y_train, X_test_2.toarray(), Y_test)--------------------------------------------------------------------Fitting the model and prediction on train data:
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 30 out of 30 | elapsed: 57.3min finished
Completed
Time taken: 2:00:09.666172
==================================================
Prediction on test data:
Completed
Time taken: 0:00:24.416710
==================================================
Performance metrics:
==================================================
==================================================
Precision score is:
0.9994189203758796
==================================================
Recall score is:
0.999450504162431
==================================================
F1-score is:
0.9994241935579559
XGB Confusion Matrix
print_grid_search_attributes(xgb_grid)-----------------------------------------------------------------------------------------------
| Best Estimator |
---------------------------
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
min_child_weight=1, missing=None, n_estimators=200, n_jobs=-1,
nthread=None, objective='multi:softprob', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=None, subsample=1, verbosity=1)
---------------------------
| Best parameters |
---------------------------
Parameters of best estimator :
{'n_estimators': 200, 'max_depth': 3}----------------------------------
| No of CrossValidation sets |
----------------------------------
Total number of cross validation sets: 3---------------------------
| Best Score |
---------------------------
Average Cross Validate scores of best estimator : 0.999166582714376

The final results obtained from XGBoost_2 Classifier are as below:

Train results:-

xgb_grid_results_tr{'f1_score': 0.9999816952580419,
'precision': 0.9999817609521351,
'recall': 0.9999816831365796}
xgb_tpr_fpr_train{'fp': 0, 'fpr': 0.0, 'tp': 65873, 'tpr': 0.9999848195039014}

Test results:-

xgb_grid_results_test{'f1_score': 0.9994241935579559,
'model': <function __main__.model(model_name, X_train, Y_train, X_test, Y_test)>,
'precision': 0.9994189203758796,
'recall': 0.999450504162431}
xgb_tpr_fpr_test{'fp': 12,
'fpr': 0.0008310824849366299,
'tp': 21955,
'tpr': 0.9998633755351125}

Observations from XGBoost_2 Classifier :-

  • This XG Boost Classifier was able to classify points with better accuracy of ~99.94 and a high f1 score of ~0.9994 which is similar to the performance of the 1st XGB Classifier.
  • True Positives = 21955
  • TPR = 0.9998
  • False Positives = 12
  • FPR = 0.00083
  • This XGB Classifier has a TPR of (99.98%), and the FPR of (0.083%) which is the same as the XGB1 model(0.083%).

Similarly, we applied the DT and RF models on the same dataset, and below are the results that were obtained.

Below are the results of the 3 models applied on Dataset 2.

Results

Important Observation from the above 3 models:-

  • From the performance scores we have obtained from the above 3 models, we can conclude that adding new features has increased the TPR score as the no. correct classification of “Normal” class points has increased, but that has also resulted in an increase in the FPR score for all the 3 models which is not desirable.

(XI) Summarizing results and making Conclusion:-

Results
  • All the models have very close performance scores on train and test data, thus they are not OVERFITTING.

- The model XGBoost_1 is our best model for intrusion detection as it has the highest Test f1-score 0.9994 and TPR 99.98% as well as the least FPR of 0.083%.

This brings us to the end of this interesting case study where we used the KDD Cup 99 dataset and applied different ML techniques to build a Network Intrusion Detection System that is able to classify between Good and Bad connections with good precision while reducing the number of False Positives.

To get the complete code snippet, you can visit my below GitHub repository where I have also tried solving this intrusion detection problem as a binary classification problem by combining the 22 BAD categories as a single ‘BAD’ category.

References:

Thank You.

--

--