A Comparative Study of Algorithms for Phishing Website Classification — part 2

4 min readJul 14, 2023

In the previous part, we saw the Exploratory data analysis and now in this part, we will see the performance of different algorithms on phishing classification.

I have tried 5 different Classification algorithms for phishing classification.

Logistic Regression.
K-Nearest Neighbour.
Random Forest.
Kernel SVM.
XGBoost Classifier.

Data scaling techniques like normalization or standardization are primarily used for numerical features that have a wide range of values. These techniques aim to rescale the values to a common scale to prevent any particular feature from dominating the learning process.

However, since your features consist of discrete values (-1, 0, and 1), there is no need for scaling. Categorical or ordinal features already have a defined order or grouping, and their values carry a specific meaning. By applying scaling, you would risk losing the inherent structure and meaning of the data.

1. Logistic Regression

Logistic regression is a supervised learning algorithm commonly used for binary classification tasks. In this scenario, the goal is to predict whether a website is involved in phishing activities or not.

Hyperparameter tuning was done for the logistic regression and the best parameters were found to be.

Best parameters for Logistic Regression
{'C': 21.54434690031882, 'l1_ratio': 0.1, 'max_iter': 100, 'penalty': 'l1', 'solver': 'saga'}

The scores we got for Logistic regression are:
AUC-ROC Score: 0.921479354548334
Accuracy Score: 0.9240162822252375
Precision Score: 0.9269442262372348
Recall Score: 0.9402390438247012
F1 Score: 0.9335443037974683

Logistic Regression performance was decent.

2. KNN (K Nearest Neighbour)

The k-Nearest Neighbors (kNN) algorithm is a simple and intuitive classification algorithm used for both supervised and unsupervised learning tasks. In the context of supervised learning, kNN is a non-parametric method that assigns a class label to a new data point based on the majority class labels of its k nearest neighbors in the feature space. The algorithm assumes that similar data points tend to belong to the same class. The value of k, which represents the number of neighbors to consider, is a hyperparameter that can be chosen based on the dataset and problem at hand.

Hyperparameter tuning was done for this algorithm.

Best parameters for KNN
{'n_neighbors': 7, 'p': 1, 'weights': 'distance'}

For KNN the scores are:
AUC-ROC Score: 0.9588837120138692
Accuracy Score: 0.9611035730438715
Precision Score: 0.9569976544175137
Recall Score: 0.9752988047808765
F1 Score: 0.9660615627466456

3. Random Forest Classifier

Random Forest is an ensemble learning algorithm that combines multiple decision trees to make predictions. By training on a labeled dataset, where websites are categorized as phishing or non-phishing, the Random Forest model learns the relationships between these features and the target variable. This allows it to effectively classify new websites as either phishing or non-phishing based on the provided input features. The ensemble nature of Random Forest ensures robustness and generalization, making it a popular choice for classification tasks, including the identification of potential phishing websites.

Hyperparameter tuning was done on this model.

Best parameters for Random Forest Classifier
{'criterion': 'gini', 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'n_estimators': 100}

For Random Forest Classifier scores are:
AUC-ROC Score: 0.9472528296854423
Accuracy Score: 0.9511533242876526
Precision Score: 0.9401381427475057
Recall Score: 0.9760956175298805
F1 Score: 0.9577795152462862

4. Support Vector Machine (SVM)

The Support Vector Machine (SVM) algorithm is a powerful supervised learning method commonly used for binary classification tasks. It works by finding an optimal hyperplane that separates the data into different classes while maximizing the margin between the hyperplane and the closest data points from each class. SVMs are particularly useful when dealing with high-dimensional feature spaces, making them suitable for classifying websites based on a set of 30 input features.

Hyperparameter tuning was done to find the best parameters.

Best parameters: {'C': 1, 'coef0': 1.0, 'gamma': 0.1, 'kernel': 'poly'}

For SVM the scores are:
AUC-ROC Score: 0.9599051492773675
Accuracy Score: 0.9615558570782451
Precision Score: 0.9606299212598425
Recall Score: 0.9721115537848606
F1 Score: 0.9663366336633663

5. XGBoost

XGBoost, short for Extreme Gradient Boosting, is a powerful and widely used machine learning algorithm known for its exceptional performance in various domains, including classification tasks. The XGBoost classifier is an optimized implementation of the gradient boosting framework, which combines the strength of multiple weak prediction models, such as decision trees, to create a strong ensemble model. It leverages gradient boosting techniques to iteratively build new models that correct the mistakes made by the previous models, resulting in improved predictive accuracy.

Hyperparameter tuning was done to find the best parameters.

Best parameters: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300}

For XGBoost the scores are:
AUC-ROC Score: 0.9637400190034838
Accuracy Score: 0.9656264133876075
Precision Score: 0.9623529411764706
Recall Score: 0.9776892430278884
F1 Score: 0.9699604743083005

XGBoost outperformed all other classification models.

In the end, we developed streamlit application which used the XGBoost model to classify a website as phishing or not.

you can try the app:

For testing you can have sample data here: Test Data.

Link for the GitHub Repository: Streamlit-phishing-website

References:

Mohammad,Rami and McCluskey, Lee. (2015). Phishing Websites. UCI Machine Learning Repository. https://doi.org/10.24432/C51W2X.