vAlgo : The Malicious URL Model . A Machine Learning Project.
INFO:
The Dataset: URL Dataset SVM Light from UC San Diego.
The Algorithm: Decision Tree.
The Environment : Jupyter Notebook in an Ubuntu VM.
GitHub Repo: AsherNoor/vAlgo
This post is a little different than my normal posts. I’ve had requests to post my mistakes and how I solved them.
This post is a cleaned up version of the notes I took while creating this model, which means there is a lot of back and forth movement between steps, as I encounter errors and fix them. I tried my best to keep the information as clear and streamlined as possible.
I used the 7 steps of machine learning as a guide to further help keep track of where I am in the process. If your interested, I’ve written a previous post explaining the 7 steps in more details.
But in short, these are the steps:
1: Import Data.
2: Clean the data.
3: Split the data.
4: Create the model.
5: Train the model.
6: Make a prediction.
7: Evaluate & Improve.
Alright, lets do this.
THE LIBRARIES & FUNCTIONS
These are the libraries and pre-built functions that I imported into the environment.
Pandas: To read the files.
import pandas as pd
My algorithm of choice: Decision Tree
from sklearn.tree import DecisionTreeClassifier
The Train Test Split function.
from sklearn.model_selection import train_test_split
The Accuracy Score function.
from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
Now that we have imported what we need, lets follow the 7 step process to make this model.
1 IMPORTING THE DATA
I downloaded the SVM Light dataset from the UCDS website.
It had multiple files, named and marked by Day.
Example: Day 0, Day 1, Day 2, etc..
In each of those files was a compiled list of Malicious and Benign URL data.
The Malicious were marked with “+1”.
The Benign were marked with “-1”.
The thing is, this was all in the same column as the URL data.
Example: [ +1 url data ] or [ -1 url data ]. Also, all the files were SVM files, not CSV files. Pandas was able to read the SVM file, using the same command for CSV.
svm_data = pd.read_csv(“Day0.svm”)
2 CLEANING THE DATA
I will be using the Decision Tree Classifier, which requires 2 parameters for training, “Input” & “Output” sets.
I needed to create a CSV file, and I wanted it to have 2 columns, to separate the data. One column for what I call the identifier “+1 / -1” and a 2nd column for its corresponding URL data. That way my CSV would look something like this:
[ ‘id’ , ‘url data’ ]
This will allow me to create the 2 parameters I need for the Decision Tree algorithm.
Luckily, the Pandas library has the ability to do both, separate the data into 2 columns AND save it as a CSV file. Looking at the Pandas commands, I noticed that all the examples used, the columns that are being split have “names”, looking at my “Day0.svm” file, the single column does not have a name.
Any attempts to use Panadas to rename the column failed. I even tried using the “df.columns” option to see if there was a name that I had missed, but there wasn’t. Looks like I may have to do this part manually.
I opened up the Day0.svm as a spread sheet using LibreOffice. I added a top row and called it ‘mal_urls’. Saved the file as a CSV (might as well, since I can at this point).
Went back to the Jupyter notebook and created a new variable to read the CSV file I just created.
csv_data = pd.read_csv(“Day0_1col.csv”)
To check, I called for the columns.
csv_data.columns
And got a return
Index([‘mal_urls’], dtype=’object’)
Excellent, I tried some commands to split it into 2 columns, but they all gave me the same Syntax Error:
SyntaxError: cannot assign to literal.
I looked back at what “columns’ return, and noticed (dtype=’object).I thought I may need to change it into a “string’, but even doing that didn’t work. Part of the problem was that I had used an “equal” sign incorrectly.
After some research, with a dash of trial and error, I found a command that worked. I assigned the csv data to a new variable.
df = csv_data
Then used this command.
df[‘id’], df[‘url_data’] = df[‘mal_urls’].str.split(‘ ‘, n=1).str
And it worked, but now I have 3 columns: ‘mal_urls’ , ‘id’, ‘url_data’.
mal_urls : has the original information.
id: has the identifier ‘+1 / -1”.
url_data: the corresponding data to the ‘id’.
At this point, I need to do a couple of things:
(1) Drop the ‘mal_urls’ column.
• I was able to do that using this command:
df.drop([‘mal_urls’], axis=1, inplace=True)
(2) Create a new CSV file with just the ‘id’ and ‘url_data’ columns.
• This command worked.
df.to_csv(‘day0-spit.csv’, index=False)
BUT the “df.to_csv” command didn’t work right a way, it kept giving me an error, I looked up other examples of that command being used, and noticed that the only difference I had was the “df = csv_data” variable. That was the problem.
So I went back to the top, commented out the csv_data variable and created a new ‘df’ to read the old csv file:
df = pd.read_csv(“Day0_1col.csv”)
That way it’s all unified. I then ran all cells, and BOOM, my shiny new 2 column CSV was created successfully.
To make sure I used LibreOffice AND ran the df.columns command and in both, I had only 2 columns “id” and “url_data”. Looks like I’m all set.
A Breakdown of where we are right now.
I used LibreOffice to create a CSV with a title for the single column file.
Read the file: df = pd.read_csv(“Day0_1col.csv”)
I split the file: df[‘id’], df[‘url_data’] = df[‘mal_urls’].str.split(‘ ‘, n=1).str
I droped the ‘mal_urls’ column: df.drop([‘mal_urls’], axis=1, inplace=True)
Created a NEW 2 column csv file: df.to_csv(‘day0-spit.csv’, index=False)
Read the NEW csv file: df = pd.read_csv(“day0-spit.csv”)
I checked it had only 2 columns: df.columns
df = pd.read_csv("Day0_1col.csv")
df['id'], df['url_data'] = df['mal_urls'].str.split(' ', n=1).str
df.drop(['mal_urls'], axis=1, inplace=True)
df.to_csv('day0-spit.csv', index=False)
df = pd.read_csv("day0-spit.csv")
df.columns
Woh! That was an adventure, but honestly, that was fun! I learned a few cool tricks with this that I will be sure to use in my next project.
Alright! Now we have a correct, working, 2 column CSV file. But we aren’t done with this step just yet!
CLEAN UP IN AISLE 2.
At this point I realized that before I can go any further, I had to clean up my jupyter cells, to make sure that I’m reading from the NEW CSV file to create my Input and Output sets.
I went back to Step 1 “Import data” , and updated it to read my new CSV
df = pd.read_csv(“day0-split.csv”)
CREATING THE INPUT / OUTPUT SETS.
We’ve already initiated ‘df’ to read the new file, what we need now is to assign the 2 columns to 2 variables for the ‘input’ and ‘output’.
‘X’ is the Input set.
‘y’ is the Output set.
This is how I did it:
X = df[“id”]
y = df[‘url_data”]
Ran the cell, and no errors, so far so good.
On to step 3, splitting the data, finally.
3 SPLITTING THE DATA
I will be doing the recommended 80/20 split, using the ‘Train Test Split’ function from the sci-kit learn library (sklearn).
Using the Train Test Split function from sklearn, I was able to use this one-liner to split everything up.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Ran the cell, and no errors.
Excellent.
Now we move on to creating the model.
4 CREATE THE MODEL
model = DecisionTreeClassifier()
Ran the cell, no errors.
My palms are getting sweaty, this is exciting!
5 TRAIN THE MODEL
I made sure I’m sending it all the “training” data.
model.fit(X_train, y_train)
Ran the cell, and got an error. I knew that was too easy.
A Value Error to be specific.
A trick I learned about Python is that when you get an error to read the last line first, and this is what the last section said:
ValueError: Expected 2D array, got 1D array instead:
array=[-1. -1. -1. ... -1. 1. -1.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Looking back at the upper parts of the error, it seems that the parameters for “X” are not correct. It needs a 2D array, and I’m only giving it a1D array, “id”. Luckily, the error also suggested a solution, for me to ‘reshape’ the array.
Looks like there are 2 reshaping options here. If it has a single feature, or if I have a single sample.
I know the CSV has more than one sample, and so I need to reshape for the ‘single feature’ option.
The example provided is: array.reshape(-1,1) .
Lets see how we can make this adjustment to train the model.
In my research I came across an article called ‘Get Into Shape’ by Jeff Hale from Towards Data Science. One of the take aways I got from it, is this:
The predictive “X” data is expected to be in a 2D form.
The ‘y’ data is expected to be a 1D form.
This confirms that the issue is with my ‘X’ data variable, it needs to have ‘id’ and something else, or for me to reshape that array.
Now I return to Step 2 “Clean The Data”, because its this “X” that is causing the problem. Reshaping may require me to change it into a NumPy array, but he suggested to pass the column as a list instead, this is how to do that:
X = df[[“id”]]
Instead of X = df[“id”], notice the double braces around “id” in the new “X”.
Ran all the cells and, it worked.
Step 5 “Training the model” returned: DecisionTreeClassifier()
That was close.
Moving to the next step, making a prediction.
6 MAKE A PREDICTION
Here we make sure we use the Test input, and then we will check the accuracy of the prediction, this should be the new data that it has not seen.
The prediction
prediction = model.predict(X_test)
The accuracy : using the accuracy score function
score = accuracy_score(y_test, prediction)
I wanted to see the score as a percentage so I used this print option:
print(“{:.2%}”.format(score))
prediction = model.predict(X_test)
score = accuracy_score(y_test, prediction)
print("{:.2%}".format(score))
Ran the cell, and got an answer.
0.03% accurate. So, very bad.
Well, I’m still all smiles.
Because I got it to work.
7 EVALUATE & MODIFY
Now that I have a working (and very inaccurate) model, now comes the time where I experiment some more to see what else I can do to help get that accuracy score up.
It could be adding more data to the dataset, it could be trying a different algorithm all together, or incorporating AdaBoost or XGBoost to the current model and see if that helps.
Honestly I’m excited to try all the above, and I will make sure to blog about those as well.
I hope the post was not too discombobulating, and that you found it helpful.
Thanks for reading.
Lets code something cool.
Ash, The Machine Learner.
Support The Project.
Buy me a coffee | Become my GitHub Sponsor | Become a Patreon