I recently enrolled in the Metis Data Science Bootcamp (Singapore) offered through Kaplan. Together with my ten classmates of varied backgrounds, we are the first cohort and we have survived the midway mark! It has been a pretty intensive but awesome experience so far. This is one of the projects I have completed.
For my third (of five) project, I chose to use supervised learning to detect if a URL is phished or legit.
How is this important? I have a client who nearly had his email account hijacked through phishing. In fact, his supplier’s email was compromised and he received an invoice asking for payment to be deposited to another bank account number. Fortunately, he called his supplier for verification and discovered the fraud. The threat of phishing is very real and not to be under-estimated.
Here are some examples of phishing websites. In general, they want your credentials and passwords.
There are some phishing datasets on Kaggle but I wanted to try generating my own datasets for this project. I rely on these 2 sources for my list of URLs:
- Legit URLs: Ebubekir Büber (github.com/ebubekirbbr)
- Phish URLs: phishtank.com
With a bit of domain knowledge and analysis done on some phished and legit websites, I was able to shortlist the following categories to extract information:
Some useful information:
- “Phishers” usually hack into legit websites to insert phishing web pages, rather than set up a domain exclusively for phishing. While this could make it difficult to identify a phishing website through its domain, I understand registrars and hosting companies act swiftly to get website owners to remove those phishing pages in order to prevent negative impact to their ranking. This means we could possibly see empty registrar or name servers for compromised domains over time.
- One precaution to take. Some phished websites might contain malware so instead of loading those URLs directly in my browser, I did the following: (a) view screenshot of these pages using tools like https://web-capture.net (b) analyse the HTML content using text editors
Conceptually, my scraper process looks like this.
The idea is to keep the codes modular so that I can keep adding new categories when necessary. Every page that I scraped is saved as a file in my local drive for reference in case it no longer exists in future.
I used BeautifulSoup to extract attributes in tags. Although it is not foolproof, it should help to set a (random) valid “user-agent” so that servers are less likely to reject your requests thinking you are a bot.
Basic preprocessing on the URL (i.e. remove www, trailing slash) is done to ensure consistency.
Refer to Annex for a brief description of the 22 features extracted after processing the JSON files for each URL.
Exploratory Data Analysis
Since scraping data is really time consuming, I decided to start my EDA concurrently to get some sensing. After analysing about 1817 URLs (930 phish, 887 legit) and their features, I chose to drop some generally empty or nominal features to work on these 15.
URL Domain Network Page Whois
-------------- --------------- ------------ ---------- ---------
length len_subdomain len_cookie length w_score
special_char is_https anchors
I used LASSO Regularization to identify the important features. Even with a small value of alpha, I can already see the 5 dominant features.
Frankly, I am a little surprised that the w_score had little significance. I believe given enough time I might be able to engineer one or two features but with deadline drawing near, I decided to drop the whois category (i.e. suspend ExtractorWhois module) to focus on these 5.
URL Domain Network Page Whois
-------- -------- ------------ --------- -------
length len_cookie length
I then built a simple classifier with KNN to get a baseline. The optimal K is 3 with a decent accuracy score of 0.793.
I ended up with 6906 (3501 legit, 3455 phished) observations after scraping. Not surprisingly, many of the phished pages no longer exist.
Type #URL processed #Pages available
------- ---------------- ------------------
Legit 4,000 3,501
Phish 6,000 3,455
There could be some servers or Content Management Systems (CMS, e.g. WordPress, Joomla) that are configured improperly and they returned HTTP response 200 (instead of 404) even though the page is not found. I assumed these pages are in the minority and they have a minimum impact on the eventual accuracy.
I repeated the feature selection process with the 6906 observations and once again, the same 5 dominant features surfaced. The optimum K for KNN is still 3. Good!
Here are the results for the following models.
Naive Bayes 0.757
KNN (K=3) 0.791
Log. Reg. 0.822
Decision Tree 0.836
KNN (K=3, scaled) 0.845
Random Forest 0.885
Since I have already scraped the data for 14 features, I fitted them to the different models as well. I was able to get the best accuracy from Random Forest (92.1%) but I would later find out using live data that it generally performs worse, likely due to overfitting.
Let’s try some URLs not found in the dataset. How about “https://metis.kaplan.com.sg”?
How about a phished link? Visit http://phishtank.com/phish_archive.php to find a “valid phish”.
Copy and paste the URL and viola!
What a journey! I started with analysing phished and legit web pages, scraping them, doing feature selection, building the classification models and finished with a website to show the results.
Although the accuracy of 88.5% for Random Forest is pretty high, I find that when it comes to live data, the results still fall short of expectations.
There is definitely room for more improvements. I believe further efforts on the following would make the classifier more robust and accurate:
- more observations; if I were to increase the number of pages to scrape to say 50,000, I would improve the scraper by implementing some form of state (done or not) in Postgres SQL for each URL so that I will be able to resume scraping anytime it gets interrupted for whatever reasons (e.g. network error)
- check for presence of url shorteners
- implement feature engineering with whois or even SSL parameters
- use ensemble models in sklearn to get a single prediction based on different forms of “averaging” the models