Supervised Learning to detect Phishing URLs

SingTat
SingTat
Sep 8, 2019 · 6 min read

I recently enrolled in the Metis Data Science Bootcamp (Singapore) offered through Kaplan. Together with my ten classmates of varied backgrounds, we are the first cohort and we have survived the midway mark! It has been a pretty intensive but awesome experience so far. This is one of the projects I have completed.

Photo by John Sekutowski on Unsplash

Background

Phishing page requesting for Paypal credentials
Phishing page requesting for gamer credentials

Preliminary Analysis

5 Categories associated with web pages
5 Categories associated with web pages
Parameters to extract for a web page fall under 5 categories

Data Acquisition

Exploratory Data Analysis

  URL            Domain          Network      Page       Whois    
-------------- --------------- ------------ ---------- ---------
length len_subdomain len_cookie length w_score
special_char is_https anchors
depth form
email
password
signin
hidden
popup

Feature Selection

[('len', 0.0006821926601753635),
('count_s', 0.0),
('depth', 0.0),
('len_subdomain', 0.0),
('is_https', 0.0),
('len_cookie', -0.0002472539769316538),
('page_length', -2.4074484401619206e-07),
('page_num_anchor', -0.0006943876695101922),
('page_num_form', -0.0),
('page_num_email', -0.0),
('page_num_password', 0.0),
('page_num_signin', 0.0),
('page_num_hidden', -0.00041105959874092535),
('page_num_popup', -0.0),
('w_score', -0.0)]
  URL      Domain   Network      Page      Whois  
-------- -------- ------------ --------- -------
length len_cookie length
anchors
hidden

Models

 Type    #URL processed   #Pages available  
------- ---------------- ------------------
Legit 4,000 3,501
Phish 6,000 3,455
Model               Accuracy  
------------------- ----------
Naive Bayes 0.757
SVC 0.760
KNN (K=3) 0.791
Log. Reg. 0.822
Decision Tree 0.836
KNN (K=3, scaled) 0.845
Random Forest 0.885

Demo

Classifying a legit URL — Oops, could be better
Find a phished URL from phishtank.com
Classifying a phished URL — not too bad

Conclusion

References

Annex

Initial 22 fields during Preliminary Analysis
Initial 22 fields during Preliminary Analysis
Description of the initial 22 features

The Startup

Get smarter at building your thing. Join The Startup’s +788K followers.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

SingTat

Written by

SingTat

IT geek who gets inspirations from everyday life and surroundings

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +788K followers.

SingTat

Written by

SingTat

IT geek who gets inspirations from everyday life and surroundings

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +788K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store