The Art and Science of Cybersecurity Attack Detection: A Hybrid Approach

Combining Machine Learning and Rules for Cybersecurity.

Sina Nazeri
The Power of AI
16 min readApr 25, 2023

--

In today’s digital age, cyber-attacks are a major concern for everyone. In fact, the recent polling result showed that people are more afraid of cyber attacks than they are of nuclear war 🤯 (Gallup poll published in March 2023). According to IBM ‘Cyber security is not a luxury, but a necessity in the digital age’. Therefore, this project aims to improve cyber security by using machine learning and rule-based methods to detect attacks by analyzing network data.

➜ Pro tip: If you like to learn faster and run (or download) this project in Jupyter Notebook for free, visit CognitiveClass.ai.

Library needed

pip install scikit-learn==1.0.0
pip install dtreeviz
pip install seaborn

Our main goal is to understand how attacks happen and what are the important indicators of attack.

After completing this project you will be able to:

  • Understand how cyber attacks occur and identify important indicators of attacks.
  • Implement a monitoring system for attack detection using both rule-based and machine-learning approaches.
  • Learn how to visualize variables in network data.
  • Gain experience using machine learning algorithms such as Random Forest for classification and feature ranking.
  • Enhance your knowledge and skills in cybersecurity and introduce powerful tools to equipped to detect and prevent cyber attacks

Strategies to Detect Cyber Attacks

  1. The first approach to detecting cyber attacks is to use a rule-based system. These systems use a set of predefined rules to identify potential attacks based on known attack patterns.
  2. Another approach to detecting cyber attacks is to use machine learning algorithms, such as Random Forest and AdaBoost.

3. In addition to these automated methods, human analysis can play a critical role in identifying cyber attacks.

image credit: https://pixabay.com/

Therefore, our strategy involves utilizing establishing a rule-based system as the first layer of detection. Then, we utilize a machine learning algorithm to pinpoint attacks. Finally, we delve into the variables to understand their significance and examine their importance as indicators of cyber attacks.

Cyber Attack Data

The data is collected by the University of New South Wales (Australia). That includes records of different types of cyber attacks. The dataset contains network packets captured in the Cyber Range Lab of UNSW Australia. The data is provided in two sets of training and testing data. We combine them to create one set of larger data.

## loading the data
training = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0Q8REN/UNSW_NB15_training-set.csv")
testing = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0Q8REN/UNSW_NB15_testing-set.csv")
print("training ",training.shape)
print("testing ",testing.shape)

Try this code in the notebook free at CognitiveClass.ai

To achieve a better performance, we will create a larger dataset and assign 70% for training and 30% to testing.

# checking if all the columns are similar
print(all(training.columns == testing.columns))

# creating one-whole dataframe which contains all data and drop the 'id' column
df = pd.concat([training,testing]).drop('id',axis=1)
df = df.reset_index(drop=True)

# print one attack sample
df.head(2)

The dataset includes 43 variables regarding monitoring the network and 2 variables that define if an attack happens (label) and the types of attacks (attack_cat). The description of all the variables is available at the end of this notebook.

# getting the attack category column 
df.attack_cat.unique()
Try this code in the notebook (at CognitiveClass.ai)

The dataset includes nine types of attacks, including:

  1. Fuzzers: Attack that involves sending random data to a system to test its resilience and identify any vulnerabilities.
  2. Analysis: A type of attack that involves analyzing the system to identify its weaknesses and potential targets for exploitation.
  3. Backdoors: Attack that involves creating a hidden entry point into a system for later use by the attacker.
  4. DoS (Denial of Service): Attack that aims to disrupt the normal functioning of a system, making it unavailable to its users.
  5. Exploits: Attack that leverages a vulnerability in a system to gain unauthorized access or control.
  6. Generic: A catch-all category that includes a variety of different attack types that do not fit into the other categories.
  7. Reconnaissance: Attack that involves gathering information about a target system, such as its vulnerabilities and potential entry points, in preparation for a future attack.
  8. Shellcode: Attack that involves executing malicious code, typically in the form of shell scripts, on a target system.
  9. Worms: A type of malware that spreads itself automatically to other systems, often causing harm in the process.

These nine categories cover a wide range of attack types that can be used to exploit a system, and it is important to be aware of them to protect against potential security threats.

Data Exploration

In this section, we briefly explore our dataset.

# exploring the types of variables
df.info()

As we can see, some variables, that are categorical, are defined as strings. In the following cell, we convert them into categorical type provided by pandas.

# some columns should be change from string to categoriacal
for col in ['proto', 'service', 'state']:
df[col] = df[col].astype('category').cat.codes
df[col] = df[col].astype('category').cat.codes

df['attack_cat'] = df['attack_cat'].astype('category') # keep the nomical info for attack info

➜ Try this code in the notebook free at CognitiveClass.ai

Exploring how many records of different types of attacks are in the dataset.

# explore different types of attackes
print(df[df['label']==1]
['attack_cat']
.value_counts()
)
# plot the pie plot of attacks
df[df['label']==1]['attack_cat'].value_counts()\
.plot\
.pie(autopct='%1.1f%%',wedgeprops={'linewidth': 2, 'edgecolor': 'white', 'width': 0.50})
Try this code in the notebook (at CognitiveClass.ai)

Implementing Rule-Based System

Both rule-based systems and machine learning systems have their own strengths and weaknesses, and using both together can provide a more comprehensive and effective approach to detecting cyber attacks. Here are a few reasons why:

  1. Explainability: Rule-based systems provide clear and concise rules that can be easily understood and interpreted by human experts. This makes it easier to understand how the system is making its predictions and to validate the results.
  2. Robustness: Rule-based systems are less likely to be affected by unexpected changes in the data distribution compared to machine learning models. They can still provide accurate results even when the data changes, as long as the rules remain valid.
  3. Speed: Rule-based systems can be much faster than machine learning models, especially for simple problems. This can be important in real-time monitoring systems where the response time needs to be fast.
  4. Complementary strengths: Rule-based systems and machine learning models can complement each other. Rule-based systems can be used to detect simple, well-defined attacks, while machine learning models can be used to detect more complex, subtle attacks.

In our project, we first employ rule-based model and then we utilize a machine-learning model. By combining rule-based systems and machine learning models, it is possible to take advantage of the strengths of each approach to create a more effective and comprehensive system for detecting cyber attacks.

Evaluation Metric

In the rule-based model, we are looking for a higher recall rate because we are sensitive to alarm potential threats, and we can not afford to miss attacks (FALSE NEGATIVE). Recall (or True Positive Rate) is calculated by dividing the true positives (actual attacks) by anything that should have been predicted as positive (detected and non-detected attacks).

Learn more about confusion matrix (and image credit): https://keytodatascience.com/confusion-matrix/

➜ Try this code in the notebook free at CognitiveClass.ai

# separating the target columns in the training and testing data 
from sklearn.model_selection import train_test_split

# Split the data into variables and target variables
# let's exclude label columns
X = df.loc[:, ~df.columns.isin(['attack_cat', 'label'])]
y = df['label'].values

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=11)

# Getting the list of variables
feature_names = list(X.columns)

# print the shape of train and test data
print("X_train shape: ", X_train.shape)
print("y_train shape: ", y_train.shape)
print("X_test shape: ", X_test.shape)
print("y_test shape: ", y_test.shape)

We use a decision tree model to create a set of criteria for detecting cyber attacks in our rule-based system. The goal of this first layer of protection is to have a high recall rate, so we conduct a grid search to optimize the model toward maximizing recall.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
'criterion': ['gini', 'entropy'],
'max_depth': [2, 4],
'min_samples_split': [2, 4],
'min_samples_leaf': [1, 2]
}

# Create a decision tree classifier
dt = DecisionTreeClassifier()

# Use GridSearchCV to search for the best parameters
grid_search = GridSearchCV(dt, param_grid, cv=5, scoring='recall')
grid_search.fit(X_train, y_train)

# Print the best parameters and best score
print("Best parameters:", grid_search.best_params_)
print("Best recall score:", grid_search.best_score_)

Using the parameters above, adjust the decision tree for a high recall rate.

from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score

clf=grid_search.best_estimator_
#same as
#clf = DecisionTreeClassifier(max_depth=2, min_samples_leaf=1, min_samples_split=2, criterion= 'entropy')
#clf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = clf.predict(X_test)

# Calculate the mean absolute error of the model
recall = recall_score(y_test, y_pred)
print("Recall: ", recall)

# Recall: 1.0

One of the strengths of a decision tree is to present the sets of rules than can be utilized for rule-based systems. Here, we visualize the rules.

# plot the tree 
from sklearn.tree import export_text
import dtreeviz

print(":::::::> The RULES FOR HIGH RECALL RATE <::::::: \n" ,export_text(clf,feature_names=feature_names))

# visualizing the tree
viz_model = dtreeviz.model(clf,
X_train=X_train, y_train=y_train,
feature_names=feature_names)

v = viz_model.view(fancy=True) # render as SVG into internal object
v
Try this code in the notebook (at CognitiveClass.ai)

We create rules for those that are identified as potential attacks (class 1) in the decision tree. Then, filter out the testing set. We apply our rules to the testing data and call them test_2.

X_test = X_test.reset_index(drop=True)

# filter out testing part based on our rules
rules= "(sttl <= 61.00 & sinpkt<= 0.00) | (sttl > 61.00 )"

# getting the index of records to keep
ind = X_test.query(rules).index

# filtering test set (both X_test and y_test)
X_test_2 = X_test.loc[ind,:]
y_test_2 = y_test[ind]

print(X_test.shape)
print(X_test_2.shape)
print("filtered data" , (1- np.round(X_test_2.shape[0] / X_test.shape[0],2))*100, "%")

Our simple rule-based system filtered 23% of network traffic for further analysis, demonstrating its efficacy in detecting non-threatening network activity. In practice, rule-based systems are more complex and capable of detecting the vast majority of non-threatening network traffic.

The next step involves using machine learning to detect cyber attacks by applying the trained model to the filtered data (test_2) from the previous step. It may be useful to introduce Snort, which is a powerful open-source detection software that can be utilized for network security.

Introducing Snort For Rule-Based System

Snort is a free and open-source rule-based system for network intrusion detection and prevention system. It analyzes the network traffic and identifies potential security threats based on specific patterns or behaviors. To use Snort, install and configure it, create rules, start it, and analyze alerts. It requires a solid understanding of networking and security before using it in a production environment.

source: https://www.snort.org/

However, keep in mind that rule-based models may not be enough to protect against cyber attacks, especially in cloud services where more sophisticated strategies are needed. I will elaborate on a cloud Security tool call Qradar in part 8.

Machine Learning Model For Cyber Attack Detection

image credit: https://pixabay.com/

The combination of machine learning and rule-based models offers several advantages in detecting cyber attacks:

  1. Improved accuracy: Machine learning models can identify complex patterns and relationships in data, whereas rule-based models are limited by the explicit rules defined.
  2. Enhanced interpretability: Rule-based models are easier to understand and interpret, making it easier to validate the results generated by machine learning models.
  3. Increased speed: Machine learning models can quickly analyze large amounts of data, while rule-based models can make decisions faster in real-time.
  4. Better scalability: Machine learning models can be easily updated and retrained on new data, while rule-based models can be difficult to update as the threat landscape changes.
  5. Enriched data utilization: Both methods can complement each other by using different data sources and types, leading to a more comprehensive analysis.

Building a RandomForest Model

Random Forest is a good choice for cyber attack detection due to its high accuracy in classifying complex data patterns. The ability to interpret the results of Random Forest models also makes it easier to validate and understand the decisions it makes, leading to more effective and efficient cyber security measures.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score

# Create a Random Forest model
rf = RandomForestClassifier(random_state=123)

# Train the model on the training data
rf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf.predict(X_test_2)

# Calculate the mean absolute error of the model
acc = accuracy_score(y_test_2, y_pred)
rec = recall_score(y_test_2, y_pred)
per = precision_score(y_test_2, y_pred)
print("Recall: ", rec)
print("Percision: ", per)
print("Accuracy: ", acc)

As we can see, the random forest algorithm showed strong performance in cyber attack detection. To gain better insight into the performance of our prediction model, let’s plot a confusion matrix. It is important to note that the majority of our data contains actual attack information, as we filtered out some portion of non-threatening traffic in the previous step.

# plot confusion matrix
cross = pd.crosstab(pd.Series(y_test_2, name='Actual'), pd.Series(y_pred, name='Predicted'))
plt.figure(figsize=(5, 5))
sns.heatmap(cross, annot=True,fmt='d', cmap="YlGnBu")
plt.show()
Try this code in the notebook (at CognitiveClass.ai)

To understand the functioning of the final tree in the random forest, we will print the rules present in the 100th tree to a file named Tree_output.txt. You can access the file by clicking file browser located in the left panel or pressing ctrl + shift + f (in Windows) and command + shift + f (in Mac).

This will allow us to have a visual representation of the tree and help in better understanding how the model is making decisions to detect cyber attacks. The rules present in the tree can also be used as a reference for developing a rule-based system or fine-tuning the model for better results. The output will also highlight the most important factors considered by the model for attack detection, which can be useful for further analysis and optimization.

➜ Try this code in the notebook free at CognitiveClass.ai

# save the 100th tree sample in random forest in the file 
from sklearn.tree import export_text
feature_names = list(X.columns)

# Create a file and write to it
with open("Tree_output.txt", "w") as file:
print(export_text(rf.estimators_[99],
spacing=3, decimals=2,
feature_names=feature_names), file=file)

Human Analysis

In addition to these automated methods, human analysis can play a critical role in identifying cyber attacks. Human analysis is important in identifying cyber attacks. Analysts use their expertise to interpret data and understand the context of an attack. Understanding key variables in network data is crucial for effective human analysis in detecting cyber attacks.

Correlations In The Dataset

To improve our understanding of the variables involved in cyber attack detection, we first need to analyze the network data. Correlation diagrams can be helpful in visualizing how different variables are associated with each other and with cyber attacks. Additionally, random forest models can help identify the importance of different features in predicting the target variable (cyber attacks). We can compare the feature rankings from the random forest with the results of the correlation analysis to gain a better understanding of the key features to focus on for effective cyber-attack detection.

# creating the correlation matrix
plt.figure(figsize=(12, 10))
mask = np.triu(np.ones_like(df.corr(), dtype=np.bool))
sns.heatmap(df.corr(),vmin=-1, vmax=1,cmap='BrBG', mask=mask)
Try this code in the notebook (at CognitiveClass.ai)

The heatmap visualizes the correlation between variables in the dataset. It shows that certain features are highly correlated, such as tcprtt with ackdat and synack. This is because these variables measure different aspects of the same TCP connection setup process. Specifically, tcprtt is the round-trip time it takes for the TCP connection to be established, while ackdat measures the time between the SYN_ACK and ACK packets, and synack measures the time between the SYN and SYN_ACK packets. Since these variables are all related to the same underlying process of establishing a TCP connection, they are highly correlated.

Let’s have a look at the correlation of variables with the cyber attack (label column):

The following variables are positively correlated with cyber attacks:

  • sttl: Source to destination time to live value. Attackers may use techniques such as packet fragmentation or tunneling to avoid detection or bypass security measures, which can increase the number of hops or decrease the TTL value. A higher value for sttl may be indicative of such techniques.
  • ct_state_ttl and state: These features reflect various stages of TCP connections and may be related to port scanning, SYN flood, or DDoS attacks. Attackers may exploit the state of TCP connections using different techniques, which may be reflected in the values of ct_state_ttl and state.
  • ct_dst_sport_ltm: This feature measures the number of connections from the same source IP to the same destination port in a short time period. Attackers may initiate multiple connections to the same port in a short time period to exploit vulnerabilities or launch attacks against a particular service or application, which may be reflected in a higher value for ct_dst_sport_ltm.
  • rate: This feature may represent various types of traffic rates or frequencies. Attackers may generate high traffic rates or bursts of traffic to overwhelm or bypass security measures, which may be reflected in a higher value for rate.

In contrast, the following variables are negatively correlated with cyber attacks:

  • swin: The size of the TCP window may decrease during an attack when attackers try to flood the network with traffic. A lower value for swin may be indicative of such attacks.
  • dload: A decrease in the download speed may be indicative of an attack that consumes network bandwidth, such as DDoS attacks or worm propagation. A lower value for dload may be reflective of such attacks.

Feature Ranking From Random Forest

The random forest provides a list of features based on their contributions to the prediction model. The feature ranking can be accessed through the RandomForest object (in our example rf) using feature_importances_ attribute.

# creating of ranking data frame
feature_imp = pd.DataFrame({'Name':X.columns, 'Importance':rf.feature_importances_})

# sorting the features based on their importance value
feature_imp = feature_imp.sort_values('Importance',ascending=False).reset_index(drop=True)

# show only 10 most important feature in style of gradien of colores
feature_imp[:10].style.background_gradient()
Try this code in the notebook (at CognitiveClass.ai)
# plot the important features
feat_importances = pd.Series(rf.feature_importances_, index=X.columns)
feat_importances.nlargest(20).plot(kind='barh',color=['g','b']*5)
Try this code in the notebook (at CognitiveClass.ai)

As we can see, the feature importance ranking is aligned with the correlation result. This highlights the importance of top features such sttl, ct_stat_ttl, rate, and dload.

Following is a brief description of some of these important features (a full list of features is available in the notebook file).

Let’s select only the top 10 features and find their associations with the type of cyber attack.

# get the names of top 10 features
top10= feature_imp.Name[:10].tolist()

# get the attack names
attack_names = np.array(df['attack_cat'].unique())

# selecting only top 10 features
X_top = df.loc[:, df.columns.isin(top10)]
# need to convert the catagorical data into numbers (e.g. normal ->0, Blackdoor ->2)
y_top = pd.factorize(df['attack_cat'])[0]


# for the purpose of visualization we set max_depth to 6 in order to be shown in the notebook
clf_top10 = DecisionTreeClassifier(max_depth=6)

# Split the data into train and test sets
X_train_top, X_test_top, y_train_top, y_test_top = train_test_split(X_top, y_top, test_size=0.3, random_state=11)

# Train the model on the training data
clf_top10.fit(X_train_top, y_train_top)

# visualizing the tree
viz_model = dtreeviz.model(clf_top10,
X_train=X_train_top, y_train=y_train_top,
class_names=attack_names,
feature_names=top10)

v = viz_model.view(fancy=False,scale=1) # render as SVG into internal object
v
#v.save("The_100th_tree.svg") # if you willing to save the
Try this code in the notebook (at CognitiveClass.ai)

For a better understanding, we can randomly select a point and visualize the path for prediction.

# get a random point
rand = np.random.randint(0, len(X))
sample_point = X.iloc[rand,:].values

# visualizing the path for the point
v = viz_model.view(fancy=True,scale=1.5,x=sample_point,show_just_path=True)
v
Try this code in the notebook (at CognitiveClass.ai)

please keep in mind that we utilize a simple decision tree for visualization (above cells), and random forest can outperform the decision tree in predicting the type of attack.

Cyber Security for Cloud Services

We may scratch the surface, but as you start implementing your system, you will inevitably encounter complex issues. However, there are powerful cybersecurity tools available that you should consider.

The complexities of cybersecurity in cloud services include shared responsibility, data privacy, complex architecture, multi-tenancy, regulatory compliance, and vulnerability to attacks. To mitigate these risks, effective cybersecurity strategies must be in place.

Implementing cybersecurity measures for cloud computing can be particularly challenging due to several reasons, such:

  • Shared responsibility: In cloud computing, the responsibility for security is shared between the cloud provider and the customer, which can lead to confusion and a lack of clear ownership over security issues.
  • Complex architecture: Cloud environments typically have complex and dynamic architecture, making it difficult to implement and manage effective security controls.
  • Multi-tenancy: Cloud providers often use multi-tenant infrastructure, where multiple customers share the same physical and virtual resources. This can lead to security risks, such as the accidental or intentional exposure of one customer’s data to another.
  • Regulatory compliance: Organizations must comply with regulations such as the General Data Protection Regulation (GDPR) or the Health Insurance Portability and Accountability Act (HIPAA), which can be difficult to achieve in a cloud environment.
  • Vulnerability to attacks: Cloud environments are vulnerable to attacks such as distributed denial of service (DDoS) attacks, malware, and unauthorized access, making it critical to implement appropriate measures to mitigate the risks.

Therefore, implementing effective cybersecurity measures in cloud computing requires a comprehensive and multi-layered approach to address these challenges and secure sensitive data and systems.

8.1 IBM QRadar

IBM Security® QRadar® Security Information and Event Management (SIEM) helps security teams detect, prioritize and respond to threats across the enterprise. As an integral part of your XDR and zero trust strategies, it automatically aggregates and analyzes log and flow data from thousands of devices, endpoints, and apps across your network, providing single, prioritized alerts to speed incident analysis and remediation. QRadar SIEM is available for on-premises and cloud environments.

Thanks for reading!

You can follow me on Medium or LinkedIn and stay tuned for more articles on Data Science, Machine Learning, and AI.

If you are interested in my project, here is my IBM skills network profile:

--

--

Sina Nazeri
The Power of AI

Data Scientist at IBM with broad ML skills: Classification, Clustering, CV, NLP, Generative AI. Strong academic background & research/work experience.