Supervised Learning to detect Phishing URLs

Published in

The Startup

6 min readSep 8, 2019

I recently enrolled in the Metis Data Science Bootcamp (Singapore) offered through Kaplan. Together with my ten classmates of varied backgrounds, we are the first cohort and we have survived the midway mark! It has been a pretty intensive but awesome experience so far. This is one of the projects I have completed.

Background

For my third (of five) project, I chose to use supervised learning to detect if a URL is phished or legit.

How is this important? I have a client who nearly had his email account hijacked through phishing. In fact, his supplier’s email was compromised and he received an invoice asking for payment to be deposited to another bank account number. Fortunately, he called his supplier for verification and discovered the fraud. The threat of phishing is very real and not to be under-estimated.

Here are some examples of phishing websites. In general, they want your credentials and passwords.

Phishing page requesting for Paypal credentials

Phishing page requesting for gamer credentials

Preliminary Analysis

There are some phishing datasets on Kaggle but I wanted to try generating my own datasets for this project. I rely on these 2 sources for my list of URLs:

Legit URLs: Ebubekir Büber (github.com/ebubekirbbr)
Phish URLs: phishtank.com

With a bit of domain knowledge and analysis done on some phished and legit websites, I was able to shortlist the following categories to extract information:

5 Categories associated with web pages — Parameters to extract for a web page fall under 5 categories

Some useful information:

“Phishers” usually hack into legit websites to insert phishing web pages, rather than set up a domain exclusively for phishing. While this could make it difficult to identify a phishing website through its domain, I understand registrars and hosting companies act swiftly to get website owners to remove those phishing pages in order to prevent negative impact to their ranking. This means we could possibly see empty registrar or name servers for compromised domains over time.
One precaution to take. Some phished websites might contain malware so instead of loading those URLs directly in my browser, I did the following: (a) view screenshot of these pages using tools like https://web-capture.net (b) analyse the HTML content using text editors

Data Acquisition

Conceptually, my scraper process looks like this.

The idea is to keep the codes modular so that I can keep adding new categories when necessary. Every page that I scraped is saved as a file in my local drive for reference in case it no longer exists in future.

I used BeautifulSoup to extract attributes in tags. Although it is not foolproof, it should help to set a (random) valid “user-agent” so that servers are less likely to reject your requests thinking you are a bot.

Basic preprocessing on the URL (i.e. remove www, trailing slash) is done to ensure consistency.

Refer to Annex for a brief description of the 22 features extracted after processing the JSON files for each URL.

Exploratory Data Analysis

Since scraping data is really time consuming, I decided to start my EDA concurrently to get some sensing. After analysing about 1817 URLs (930 phish, 887 legit) and their features, I chose to drop some generally empty or nominal features to work on these 15.

  URL            Domain          Network      Page       Whois    
 -------------- --------------- ------------ ---------- --------- 
  length         len_subdomain   len_cookie   length     w_score  
  special_char   is_https                     anchors             
  depth                                       form                
                                              email               
                                              password            
                                              signin              
                                              hidden              
                                              popup

Feature Selection

I used LASSO Regularization to identify the important features. Even with a small value of alpha, I can already see the 5 dominant features.

[('len', 0.0006821926601753635),
('count_s', 0.0),
('depth', 0.0),
('len_subdomain', 0.0),
('is_https', 0.0),
('len_cookie', -0.0002472539769316538),
('page_length', -2.4074484401619206e-07),
('page_num_anchor', -0.0006943876695101922),
('page_num_form', -0.0),
('page_num_email', -0.0),
('page_num_password', 0.0),
('page_num_signin', 0.0),
('page_num_hidden', -0.00041105959874092535),
('page_num_popup', -0.0),
('w_score', -0.0)]

Frankly, I am a little surprised that the w_score had little significance. I believe given enough time I might be able to engineer one or two features but with deadline drawing near, I decided to drop the whois category (i.e. suspend ExtractorWhois module) to focus on these 5.

  URL      Domain   Network      Page      Whois  
 -------- -------- ------------ --------- ------- 
  length            len_cookie   length           
                                 anchors          
                                 hidden

I then built a simple classifier with KNN to get a baseline. The optimal K is 3 with a decent accuracy score of 0.793.

Models

I ended up with 6906 (3501 legit, 3455 phished) observations after scraping. Not surprisingly, many of the phished pages no longer exist.

 Type    #URL processed   #Pages available  
------- ---------------- ------------------ 
 Legit            4,000              3,501  
 Phish            6,000              3,455

There could be some servers or Content Management Systems (CMS, e.g. WordPress, Joomla) that are configured improperly and they returned HTTP response 200 (instead of 404) even though the page is not found. I assumed these pages are in the minority and they have a minimum impact on the eventual accuracy.

I repeated the feature selection process with the 6906 observations and once again, the same 5 dominant features surfaced. The optimum K for KNN is still 3. Good!

Here are the results for the following models.

Model               Accuracy  
------------------- ---------- 
Naive Bayes            0.757  
SVC                    0.760  
KNN (K=3)              0.791  
Log. Reg.              0.822  
Decision Tree          0.836  
KNN (K=3, scaled)      0.845  
Random Forest          0.885

Side Note:
Since I have already scraped the data for 14 features, I fitted them to the different models as well. I was able to get the best accuracy from Random Forest (92.1%) but I would later find out using live data that it generally performs worse, likely due to overfitting.

Demo

I built a website using Flask, Spectre CSS Framework and native JavaScript on an Amazon EC2.

Let’s try some URLs not found in the dataset. How about “https://metis.kaplan.com.sg”?

Classifying a legit URL — Oops, could be better

How about a phished link? Visit http://phishtank.com/phish_archive.php to find a “valid phish”.

Copy and paste the URL and viola!

Conclusion

What a journey! I started with analysing phished and legit web pages, scraping them, doing feature selection, building the classification models and finished with a website to show the results.

Although the accuracy of 88.5% for Random Forest is pretty high, I find that when it comes to live data, the results still fall short of expectations.

There is definitely room for more improvements. I believe further efforts on the following would make the classifier more robust and accurate:

more observations; if I were to increase the number of pages to scrape to say 50,000, I would improve the scraper by implementing some form of state (done or not) in Postgres SQL for each URL so that I will be able to resume scraping anytime it gets interrupted for whatever reasons (e.g. network error)
check for presence of url shorteners
implement feature engineering with whois or even SSL parameters
use ensemble models in sklearn to get a single prediction based on different forms of “averaging” the models