Malicious webpage classification using machine learning

Published in

Analytics Vidhya

7 min readApr 17, 2021

Introduction

ML based applications has been on the rise for a long time, we see its usages on almost all areas, let it be predicting scores, recommender systems, stock markets and list goes on forever. We use Machine learning to classify webpages whether they are malicious or not as well. This helps in making us aware about the websites and we can also save our information from getting stolen. Hereby in the next section, I explain the algorithm and code used for webpage classification.

Code explanation

This block of code imports the necesTsary libraries and functionalities required for preprocessing of our dataset and predicting the webpage URLs as whether they are malicious or not.

· Pandas- Pandas is required for loading the dataset from its directory, for necessary preprocessing and segregating the test data, train data and test labels.

· Matplotlib — Matplotlib is required for data visualization, to understand the interesting trends underlying within the data and to make process them accordingly.

· re- re is short-hand notation of Regular expression, which we use it during the preprocessing of our dataset to remove stop words, dots, ‘www’ & etc.

· Train_test_split- We import this for splitting our data into proper proportionate of test and train data. The most encouraged proportion of test and train data is 20% and 80% respectively.

· TFIDFVectorizer-TF-IDF is an abbreviation for Term Frequency Inverse Document Frequency. This is very common algorithm to transform text into a meaningful representation of numbers which is used to fit machine algorithm for prediction.

· CountVectorizer — Used to convert a collection of text documents to a vector of term/token counts. It also enables the pre-processing of text data prior to generating the vector representation.

· LogisticRegression, MultinomialNB- Pre-trained ML libraries which will be used for predicting whether the webpage is malicious or not.

Confusion Matrix, Classification report- We use this to understand the performance of our ML algorithm over the dataset, and necessary amendments to improve its performance

This block of code loads the dataset from the directory. Then we assign a particular data of dataset ‘URLs’ as a test url to look how well the preprocessing is done. In subsequent line we import train_test_split of Sklearn library to split the data set into 80% of train and 20% of test dataset. We assign the variable ‘labels’ to ‘class’ column of train data, while we assign ‘test_labels’ to class column of test data. Then we print the no of Train data samples and test data samples in the console like this:

This block of code is for visualizing the no of good and bad urls in both train data and test data respectively. We use pandas functionalities to count the no of good and bad urls. The upper portion of block does the counting of good and bad url in train data, while the below portion of block does the counting of good and bad url for the test data. The counting process is carried over by ‘pd.value_counts’. Using ‘.plot’ we depict the count of good and bad urls in form of bar graph. Then we specify certain parameters such as fontsize, rotation to give the heading of bar graph a certain size, we specify the graph to be in form of bar graph using the parameter ‘kind=bar’. We do the same for test data as well. This the representation we get:

We define a function for carrying out the necessary preprocessing of the text present in URL.

· tokens = re.split(‘[/-]’, url)

This line of code splits the url whenever there is a ‘/’ or ‘-’. Next upon we have a

for-loop that iterates over each and every urls in the data

· if i.find(“.”) >= 0: dot_split = i.split(‘.’)

This line of code checks for dots in the url, if found it splits the url into 2 pieces at that instant namely into, domain and extension.

· if “com” in dot_split:

dot_split.remove(“com”)

if “www” in dot_split:

dot_split.remove(“www”)

This block removes the extensions such as ‘www’ and ‘com’ in the pre-processed urls to return only the name of the webpage as they don’t add any context.

To test how well the preprocessing is done over the dataset we call in the function we defined in our last block to preprocess one of the test urls we already declared at later stage. We introduce a functionality called CountVectorizer and TFIDF vectorizer. Tokenization is the process of removing stop words from textual data and making it fit for predictive modelling. We equip to functionalities namely TFIDF & Count vectorizer. CountVectorizer is used to convert a collection of text documents to a vector of term/token counts. While, TFIDF vectorizer transforms the textual data into numbers that can be understandable for the ML model. We perform both vectorizing methods in the train and test data. The later result we get are on a large range of values representing each and every text in the data. Our ML model cannot learn such large values, hence we apply fit_transform() and transform() to scale parameter of data over a certain range and to learn the mean and variance of input features of datasets. Fit method calculates mean and variance, transform uses the mean and variance to transform the all the features in our data.Using the transform method we can use the same mean and variance as it is calculated from our training data to transform our test data.

To generate the report of the performance of ML algorithm over the dataset we define a function which generates a heat map to get confusion matrix that tells the recall and precision, gives readings about how many false-negatives, false positives, true negative and true positives as well. In the title we have score printed i.e accuracy score of our model as one show below.

After doing all the necessary pre-processing for the data, we have our data ready to be fed into the ML model. We use Multinomial Naïve Bayes and Logistic Regression as our models to make prediction whether the url is malicious or not. We define MNB and fit into the vectorized urls as our input data and labels as the predicted output. Then we generate a score that tells the accuracy of our model, and predict the urls as whether malicious or not, followed by which we call the function ‘classification_report’ and generate the corresponding report. We do the same for Logistic regression as well.

Code Conclusion

Balancing between Recall and Precision

In cases such as credit card fraudulent detection/ tumor classification, even if a single case gets misclassified, it turns out to be a serious problem as so much money or a life may be put in danger. Hence in such cases what we do is we need to bring a right balance between Recall and precision. For cases like this we need to have a high recall score as it tells us about how accurate our model has made prediction and classified True positives out of actual positives in the dataset. Through means of F1 Score, we can bring balance between precision and recall.

Do hit like button if found interesting :)))

Malicious webpage classification using machine learning

Introduction

Code explanation

Balancing between Recall and Precision

Written by Girievasan