Automatic classification of Phishing URLs

Phishing websites that pretend to be a trusted third party to gain access to private data, continue to cost Internet users over a billion dollars each year. According to a recent survey, around 80,000 people every day throughout the world become the victim of phishing attacks and share their private data. There are many techniques used to get rid of this, but the pace does not seem to be good enough to catch with the attacks. And of course, this number is not small enough to be ignored.

A lot of research is going in this field for the betterment of the society and here is one more suggestion through which these attacks can be mitigated. There are certain things to take care of before giving personal information on any website and most of the people are not aware of that. What if this whole process of verification is automated and also the concept of recent technologies like machine learning is applied so any user opening the website or any link, whether aware or unaware of these scenarios, is safe from these attacks.

Common validations that could possible be made are:

  • Checking Valid Server side SSL certificate

If you open any website, it has to be on https protocol with SSL certificate. This seemed very feasible solution to these attacks, but only till fake websites also started getting it. So, this does not mean that SSL is of no use, but further validation steps has to be introduced for better judgement.

SSL Certificate Verification
  • Validating website names

We all know that some of these sites have some spelling mistake or short form of the original website to fool people. If we already have legal site names and if we apply some algorithm to get the possible abbreviations or similar names or if the given website name can be similar by more than certain percentages, then users can be given suggestion that the possible legal sites that they want to visit can be these options and thus, user will be aware of the fact that the site one is currently visiting can not be the same as one has assumed. For example, www.facebook.com and www. facebok.com.

  • Comparing TLDs with others

TLDs (Top Level Domain) are generally followed up directly after the hostname in any URL. These spoofed URLs sometimes have some digits or text in between this hostname and domain. for example, amazon.123.com. There are high chances that these URLs are not original and if there is a functionality in our algorithm to differentiate these URLs, then it will add a significant impact.

  • Comparing Page Source Code of the page with existing ones

As the name suggest, comparing source code of the given website with the already existing ones in our database will be a great step towards validation as many phishing attacks happens based on the copying content from any legal site and changing a bit of code. Hence, if correlation between them is over certain threshold, then it’s a red flag. This is just an additional functionality over how we validate domain name in previous step by having a crawler that gets you the source code of the webpage.

  • Adding PunyCode Validation

Punycode is a special encoding used by the web browser to convert unicode characters to the limited character set of ASCII (A-Z, 0–9), supported by International Domain Names (IDNs) system. the loophole relies on the fact that if someone chooses all characters for a domain name from a single foreign language character set, resembling exactly same as the targeted domain, then browsers will render it in the same language, instead of Punycode format. This loophole allowed the researcher to register a domain name xn — 80ak6aa92e.com and bypass protection, which appears as “apple.com” by all vulnerable web browsers, including Chrome, Firefox, and Opera, though Internet Explorer, Microsoft Edge, Apple Safari, Brave, and Vivaldi are not vulnerable. If we add this validation as well to our algorithm, then we will be assure about the fact that these characters are not from any other character set. more info

  • Validating website trust seal

Trust seal is generally located on website at some visible location to user and it indicates that all third-party transactions happens in this website is verified and hence, it can be trusted. But with the huge amount of technology available, adding just this validation is not enough, but it is necessary. Not all people are aware of this trust seal and so, if in our validation, if we let user know before one accesses website that it contains this seal or not, then it will be known to them in a better way. more info

There can be many more factors added to this list that I am not aware of and although these validations/functionalities might be working in one or the other ways, making them work together in a single unit may benefit more. Here one rule that goes without saying and has to be performed as the first call is that if any URL is same as we have in our pool of spoofed or black-listed URL, then this whole process does not need to be done and user should be informed right away.

An similarity algorithm could be built to go through all these validations and tell user the result of the same to make user aware about the fact that a normal non-technical person may not know. A high-level overview of this process could be seen in the flowchart below:

Workflow for automatic classification of phishing URLs

As you can see in the figure, once all these validations are done, we categorize that URL as a legal URL or spoofed URL, hence, we will add that in our database for future use. Here comes the machine learning concept- for every URL, it gets added into our list of either legal or spoofed URLs, which helps for better future experience. If any URL comes and it is already there in one of the lists, then we can directly give the result and also, the similar URL concept will be enhanced while comparing it with any new URL as the list is enhancing after every iteration. We can have DynamoDB table for saving these information with data such as URL name, SSL certificate details, source code destination, etc..

There are still some open questions to be answered as this may serve as a solution from theoretical perspective and we want more details to make it possible in practical scenario:
  1. What is the exact legal place from where we can have data regarding which SSL certificate has been given to which website?
  2. What are the characteristics that we need to take into consideration for selecting similar URLs from our database (only comparing their URL domain-name or hostname would be very naive solution!) ?

Any suggestions or thoughts, let me know:
LinkedIn | Medium