MALICIOUS URL DETECTION WITH MACHINE LEARNING
Written by Jad Hamdouch & Ismael Bouarfa.
Introduction: Anatomy of a URL
URLs allow Internet users to navigate from one website to another. They comprehensively represent access to content that it is stored in servers, somewhere in the world. The URLs are accessible by a simple click on a link, image … or simply by writing it via our browsers. Here is the anatomy of an URL:
The favorite path used by attackers and script kiddies are social engineering manipulations, because normal users are still clicking on any link or visiting any URL they receive. Blacklisting some URL is a basic and essential way to provide a kind of primary security level.
However, can we use machine learning to help us predict if a URL is malicious or not? Deploying new solutions such as artificial intelligence (AI) for infosec is a smart way to help IT professionals to detect breaches or intrusions and improve the ability to anticipate attacks as well as reducing delays in resolving incidents.
We decided to give it a try so we followed some tutorials and read some blogs and had tried a few machine learning algorithms to predict whether the URLs where malicious or legitimate. The outcome was very interesting and provided some pretty good results. Here is how we did it:
- Step 1: Data exploration and sanitization
- Step 2: Model Selection and applying Natural Language Processing
- Step 3: Model Training
- Step 4: Predictions
Step 1: Data exploration and sanitization
Our objective is to classify URLs given as inputs to predict if they are dangerous or inoffensive. We selected good as a label for the legitimate ones and bad for the malicious. Using a dataset with many URLs (as text) already labeled, located in a CSV file, we’ll train our model.
We found an interesting dataset composed by 60K URLs.
import numpy as np
import pandas as pd
import re
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizerAfter importing Numpy, Pandas and all our friends we can explore the data and do some basic steps in data sanitization (detect and delete NaN values).
df = pd.read_csv(‘test.txt’,’,’,error_bad_lines=False)
df = pd.DataFrame(df)
df = df.sample(n=10000)
from io import StringIO
col = ['label','url']
df = df[col]#Deleting nulls
df = df[pd.notnull(df['url'])]#more settings for our data manipulation
df.columns = ['label', 'url']
df['category_id'] = df['label'].factorize()[0]
category_id_df = df[['label', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'label']].values)

As we can see, we have a dataset composed of good and bad urls. Now, how can we detect if we have more good urls in the dataset? This code allows you to create a simple bar chart of class proportion.
import matplotlib.pyplot as plt
%matplotlib inline
BAD_len = df[df[‘label’] == ‘bad’].shape[0]
GOOD_len = df[df[‘label’] == ‘good’].shape[0]plt.bar(10,BAD_len,3, label=”BAD URL”)
plt.bar(15,GOOD_len,3, label=”GOOD URL”)
plt.legend()
plt.ylabel(‘Number of examples’)
plt.title(‘Propoertion of examples’)
plt.show()

We have more good URLs. That’s what we thought. We can learn more from the good URLs to avoid bias or generalize words present in bad URL and false positives then.
Step 2: Model Selection and applying Natural language processing
Now that we have our dataset we’ll choose an approach:
What is NLP? It is the science of programming computers to process and analyze natural language. Some large and mature applications are speech recognition, translation, text classification, and so on. As the URLs are full of words (domaine name, path, file, extension…) we wanted to try it!
As learning algorithms work with numerical features we need to convert words into numerical vectors. How does it works? The first entry technique is called “Bag Of Words”.
In Bag of Words a good classifier detects patterns in words distribution and which words occur and how many times for each kind of text. However, words count is not always the best idea. As demonstrated previously, a URL can be very large and add bias for our predictions.
The following code allows you to represent for each different length the number of URLs.
import matplotlib.pyplot as plt
%matplotlib inline
lens = df.url.str.len()
lens.hist(bins = np.arange(0,300,10))
#tokenizer function for URL by Faizan Ahmad, CEO FSecurify
def getTokens(input):
tokensBySlash = str(input.encode('utf-8')).split('/')
allTokens=[]
for i in tokensBySlash:
tokens = str(i).split('-')
tokensByDot = []
for j in range(0,len(tokens)):
tempTokens = str(tokens[j]).split('.')
tokentsByDot = tokensByDot + tempTokens
allTokens = allTokens + tokens + tokensByDot
allTokens = list(set(allTokens))
if 'com' in allTokens:
allTokens.remove('com')
return allTokensWe choose to avoid Bags of Words and use another technique. So improving our model can be done using Term Frequency (TF) and Inverse Document Frequency (IDF):

This technique downscales the influence of some respecting words:
y = [d[1]for d in df] #labels
myUrls = [d[0]for d in df] #urls
vectorizer = TfidfVectorizer( tokenizer=getTokens ,use_idf=True, smooth_idf=True, sublinear_tf=False)features = vectorizer.fit_transform(df.url).toarray()
labels = df.label
features.shape
Our method can’t offend the purists of NLP even if we skipped some important basic steps:
- Removing punctuations
- Removing stop words
- Stemming (reduce a word to its stem form: removing the ‘ing’, ‘ly’, and ‘s’)
- Data Lemmatizing (coming back to the root form of the words…)
Malicious attackers play with this kind of slight modifications to make an URL looks like legitimate. As well as we need all kind of abbreviations to remain unchanged (i.e deleting the s of the https protocol that means Secure http…)
Step 3: Model Training
Logistic regression is used to model the probability of a certain class. Nowadays it is used in various fields, predicting risks of developing a disease based on test results, to classify if an image contains a rose or a cactus.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_splitmodel = LogisticRegression(random_state=0)X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features, labels, df.index, test_size=0.20, random_state=0)
model.fit(X_train, y_train)
y_pred_proba = model.predict_proba(X_test)
y_pred = model.predict(X_test)clf = LogisticRegression(random_state=0)
clf.fit(X_train,y_train)
train_score = clf.score(X_train, y_train)
test_score = clf.score(X_test, y_test)
print (‘train accuracy =’, train_score)
print (‘test accuracy =’, test_score)
In the code above, we divided our dataset in a train test used to fit the parameters and feed our model. Then with the test set we validated our model with an unbiased evaluation. The scores are good:
train accuracy = 0.843375
Based on the real class of the labels and the predicted class, we can introduce the following concepts: Type I error & Type II error (from the confusion matrix). As you can imagine, when classifying images on one category or another, it’s possible to have two types of errors:
- The type I error occurs when something is true but rejected, also known as false positive or false alarm.
- The type II error occurs when something is false but accepted, it’s a false negative.
from sklearn.metrics import confusion_matrix
import seaborn as snsconf_mat = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_mat, annot=True, fmt=’d’,
xticklabels=category_id_df.label.values, yticklabels=category_id_df.label.values)
plt.ylabel(‘Actual’)
plt.xlabel(‘Predicted’)
As shown, 1652 bad URL have been correctly predicted as malicious URLs. However 261 legitimate URLs have been identified as malicious. These “false alarms” are known as Type I error. This kind of errors are one example of reason why monitoring and supervision are necessary when applying machine learning to cybersecurity.

Step 4: Predictions
And then we made our predictions:
X_predict = [‘yahoo.fr’,’www.radsport-voggel.de/wp-admin/includes/log.exe','hello.ru']
X_predict = vectorizer.transform(X_predict)
y_Predict = clf.predict(X_predict)
print(y_Predict)
The predictions are good. ‘Yahoo.fr’ and ‘hello.ru’ are classified as good while the strange URL ending with an .exe file (executable) is classified as dangerous.
Conclusions
As we can see in our tutorial, we still have 261 false alarms that should be checked by security professionals. The idea is not to replace the security professionals with AI but to improve almost all the performances in many aspects :
- Detection : Check if it is not a false positive and escalation to the appropriate response team.
- Investigation : Forensics analysis and identification of the impact.
- Response : Find out a solution and communicate on it.
- Mitigation : Propose a security action plan to prevent recurrence.
And to make all of that work in harmony, it is essential to have a good monitoring approach.
AI and machine learning techniques are now used in finance, psychology, and economics… and will soon be present in many Infosec processes. Cyber security and all the subsets of AI and Computer Science can work together to create intelligent and effective solutions to the new threats and issues that are breaking innovation. Automatized detection of bad URLs based on machine learning and not human instructions can be a little piece of this puzzle. However, machine learning is not a magic solution and is not without its threats.
Recommended websites
- Wheregoes: to see where the URL is taking you without having to click on it.
- Url2Png: to see a screenshot of a website without having to visit it.
- VirusTotal: Our favorite to analyze suspicious URLs
