Detecting Malicious URLs

A comprehensive guide using ML and NLP Techniques

Published in

SFU Professional Computer Science

10 min readFeb 12, 2022

Authors: Viddhi Lakhwara, Saurabh Singh, Geethika Choudary

This blog is written and maintained by students in the Master of Science in Professional Computer Science Program at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/mpcs}.

Background and Motivation

Digital technologies have advanced faster than we could have fathomed — reaching almost half of the developing world’s population in less than two decades. Although it has enhanced our quality of life, being oblivious to how we use it can lead to unwanted repercussions. Heedlessly scrolling through websites and accessing weaponized links can source more damage than you may think.

The internet has opened up a world of knowledge and opportunity, it enabled people to reach a global audience providing everyone with unprecedented access to information. However, this also means that the internet has become a hotspot for malicious activity, including cybercrime. One of the most common methods criminals use to conduct their online crimes is via malicious URLs (Uniform Resource Locator), which constitutes about 60% of most cyber-attacks.

What do we mean by a malicious URL and why is it dangerous?

A malicious URL is a link that is created with the intent of promoting scams and frauds. Clicking on such a link can download a multitude of malware that will compromise your machine or network.

These URLs are well-known threats in cybersecurity that act as an efficient tool for propagating viruses, worms, and other types of malware online. They can be delivered via email links, text messages, browser popup ads, and more. Some even masquerade as legitimate links or websites, tricking web users and other unsuspecting users into infecting their machines. Eventually, this allows the attacker to gain access to the users’ sensitive data, such as banking information, social security number, email password, etc. thus making it a priority for cyber defenders to detect and surpass them efficiently.

An efficient way of steering clear of this extensive racket is by blocking or blacklisting such URLs, which is precisely what we aim to help with.

What we aim to do:

Our strategy is to devise an efficient system that classifies web-based URLs by extracting features from the link structure itself. In this approach, we will use N-gram for mining text and then classify it by using SVMs (Support Vector Machines).

Sounded like gibberish? Let’s fix that by understanding a few concepts. Buckle-up. 🎢

Overview of Concepts

Before diving deeper into how we attempt to tackle this problem, let us understand the anatomy of a URL, along with some other related theory:

N-Gram Model

An N-Gram model is a popular model for text pre-processing. It utilizes the traditional Word-to-Vector technique which is used to vectorize a given string. It may, however, miss the context of the sentence as it vectorizes only one word at a time. For example, the word “apple” can either refer to a company or the fruit. This is precisely why the concept of N-gram was introduced since it takes adjacent words or characters at a time depending upon the value of N.

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech.

Here, the item can either be a letter, word or syllable depending upon the application. Based on the size of the n-gram it can be classified as:

unigrams = 'This', 'is', 'an', 'example'
bigrams = 'This is', 'is an', 'an example'
trigrams = 'This is an', 'is an example'

When used for language modeling, the assumptions made are such that each word depends only on the previous n − 1 words. The value of N for our model is 3, implying that we’ll be using trigrams.

Tokenization

Tokenization is the act of breaking a stream of input text into smaller pieces of strings called tokens. Tokens are useful units for semantic processing and can comprise words, phrases, symbols, etc.

Tokenization is the process of demarcating and possibly classifying sections of a string of input characters.

These tokens are considered as a first step for stemming and lemmatization and the reason they are important is that models in general, do not have any knowledge of the human language. In order to learn from the text, they need to understand the structure of language.

There are different ways to tokenize a sentence, for example:

white_space_token = [Isn't][this][easy!]
punctuation_token = [Isn]['][t][this][easy][!]
character_token = [I][s][n]['][t][t][h][i][s][e][a][s][y][!]

In our model, we will be using character tokens, where the tokens are a set of all letters and numbers.

Vectorization

Vectorization is the process of converting a raw string of inputs into a numerical representation that is stored in the form of a vector. This conversion is achieved by tokenizing the input using the N-Gram Bag-of-Words Technique. The power of vectorization stems from the fact that it allows for the representation of text data as numerical values that can be used for a wide range of applications.

Let’s say we have text reviews of a movie of different lengths provided by customers. By converting each review from text to numbers, we can represent it by a finite length of the vector. In this way, the length of each vector will be equal for each review, irrespective of the text length.

In the table depicted below, each column of a vector represents a word. The values in each cell of a row show the number of occurrences of a word in a sentence.

Before we dive into vectorization, text must be pre-processed by removing punctuations, converting all words into lowercase, and removing unnecessary words that are not value-adding.

The first step is to find a vocabulary of unique words. Vocabulary in the above example:

[This, movie, is, good, sure, love, I, am, you, watch, it, will, like, too]

We have 14 unique words. Therefore, each movie review is represented by a vector of 14 dimensions. The values corresponding to each word show the number of occurrences of that word in a review. Following is the representation of both the reviews:

Classification

Support Vector Machines (SVMs) are a class of machine learning algorithms that are used for text classification. They work by finding a hyperplane that divides the text into two classes with as little overlap as possible. This results in better classification accuracy than other classifiers like Naïve Bayes and Bag-of-Words. SVMs are widely used in image classification, text classification, speech recognition, and many other fields.

Or, just scratch all that and refer to this meme:

Why did we choose SVMs?

Supervised machine learning algorithms, such as SVMs, can be used to detect malicious URLs by training them on a large dataset of known malicious URLs. Once the SVM has been sufficiently trained, it can be used to detect new malicious URLs with minimal human intervention.

SVMs operate using kernel tricks, transforming datasets into rich feature sets so that complex problems can still be solved via the same linear fashion in a lifted hyperspace.

To summarize, here’s an activity diagram to explain how we will be utilizing the above concepts to tackle our problem statement.

ML Libraries:

These are some of the ML libraries we will be using:

Scikit-learn: It features multiple classification and regression algorithms including support-vector machines that we will be utilizing.
Numpy: It adds support for large multi-dimensional matrices and arrays, including high-level functions to operate mathematics on these arrays.
Pandas: This library is used for data manipulation and analysis. It includes operations for time series and manipulating numerical tables.
NLTK: Natural Language Toolkit contains a set of libraries and programs for natural language processing (NLP) of English data. We will be using it for implementing n-grams, tokenization, etc.

Data Science Pipeline

Our pipeline can be divided into four main steps:

Step 1: Data Sources and Collection

Here are a few sources that we used for collecting the dataset of malicious links:

Phishtank: Phishtank is a service website dedicated to sharing phishing URLs. Suspicious URLs can be sent to Phishtank for verification. The data in Phishtank is updated hourly. It features multiple classification and regression algorithms including SVMs.
URLhaus: URLhaus is a project from abuse.ch that aims at sharing hostile URLs being used for illegal software distribution.
Malicious_n_Non-Malicious URL: This is a data source that contains more than 400,000 labeled URLs. In this database, 82% of all URLs are safe, while the remaining 18% are malicious.

Step 2: Data Cleaning, Data Pre-Processing, and Feature Extraction

In this stage, we convert the given input URL string into some numerical value as the raw string cannot be fed to the classifier (SVM). For converting the raw string we use various techniques such as tokenization, vectorization, etc. using the N-Gram model (TF-IDF in Python).

Step 3: Model Selection

We considered the following set of pairs for our final model: All gram with Naïve Bayes Classifier, N-Gram with Naïve Bayes Classifier, and N-gram with SVM Classifier

After comparing and deliberating, we decided to go with “N-Gram with SVM Classifier” since it gave us the best results.

Step 4: Training and Prediction

After pre-processing, the feature dataset is ready. The next stage involves training for which we are using SVMs since it provides good accuracy for large datasets like ours.

Once the model is trained we get a set of trained parameters that give us a minimum cost function. The model is now ready for making predictions. We can give the input URL to our model (after performing pre-processing) and it will predict whether the links are malicious or not.

Implementation

Now that we’ve studied all the concepts and workflow in-depth, we can finally move on to the core of this article, the implementation. 🥳

Installation and Dependencies- We start by importing all the libraries:

2. Creating all the possible combinations of n-grams using lowercase alphabets and digits and storing them in a dictionary:

3. Creating a function that takes a sentence as input and returns a list of respective n-grams:

4. A function that accepts a dataframe and 2 arrays X and Y. X denotes the features that will be used for training and Y is the label. The URL is then stripped down to the suitable format for pre-processing like removing unwanted punctuations or words (dots, slash, www. , .com, etc.). This data is then transferred to the pre-processing module for conversion into a suitable format for our model:

5. Loading the dataset and splitting it into training and testing sets:

6. The dataset has 1 million rows. We will be using SGDClassifier which by default fits a linear SVM. This estimator implements regularized linear models with Stochastic Gradient Descent (SGD) learning. The gradient of the loss is estimated for each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate). SGD allows minibatch (online out-of-core) learning via the partial_fit method:

7. Testing the model against the test set and getting the count of correct and incorrect predictions:

8. We can now display the result and accuracy of our model:

9. Let’s enter a few sample URLs to check if all this work was even worth it: 🥲

This link is correctly identified as malicious

Want to try this on your own? We’ve got you covered. Here’s a link to the entire GitHub repository for reference. Happy Coding! 🐱‍💻

Optimization and Future Scope

Future work could involve fine-tuning the model, so the algorithm becomes powerful, while also making it more accurate and nuanced. This means that it should be able to capture subtle variations in the text, resulting in predictions that are performed by utilizing the given feature set. That leads us to an open question “How can we handle a huge number of URLs whose features will evolve over time?”.

Certain efforts must be made in the direction of effective feature extraction via deep learning approaches. Emerging challenges like domain changes can be handled by acquiring labeled data and user feedback and integrating it in an online active learning approach.

Glad you made it this far! Hope you got an in-depth understanding of what you were looking for. Feel free to connect with us on Linkedln, we’re keen on hearing your suggestions and feedback.

References

[1] Bharadwaj, R., Bhatia, A., Chhibbar, L.D., Tiwari, K. and Agrawal, A., 2022, January. Is this URL Safe: Detection of Malicious URLs Using Global Vector for Word Representation. In 2022 International Conference on Information Networking (ICOIN) (pp. 486–491). IEEE.

[2] Korkmaz, M., Kocyigit, E., Sahingoz, O.K. and Diri, B., 2021, June. Phishing Web Page Detection Using N-gram Features Extracted From URLs. In 2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA) (pp. 1–6). IEEE.