URL Feature Engineering and Classification

Published in

Nerd For Tech

10 min readMay 21, 2021

In this article, we will be developing a full-fledged GUI for URL classification aided with a simple Deep Learning model via Streamlit and the conversion of Tensorflow Neural Network into a lighter and portable model for microprocessors and IoT Devices.

Cybersecurity, a jargon whose use has been steadily increasing ever since the evolution of the World Wide Web. It can be defined as the practice of securing a device connected to the Internet from malicious attacks. As the volume and sophistication of cyber attacks grow, companies and organizations, especially those that are tasked with safeguarding information relating to national security, health, or financial records, need to take steps to protect their sensitive business and personnel information. As early as March 2013, the nation’s top intelligence officials cautioned that cyber-attacks and digital spying are the top threat to national security, eclipsing even terrorism.

Malicious Websites are the prime culprits in the modern era of cyber threats. They are a front for transmitting virus, trojans and various malicious codes which may infect the user resulting in data theft and money laundry. These URLs are often propagated via e-mail links, pop-up ads, and embedded downloads.

To develop a Deel Learning Neural Network model, the initial step is to develop a feature extraction pipeline that distinguishes the URLs based on numerous uniques traits based on the URL’s String, Domain, and Page characteristics. N features will be extracted from each URL obtained for training, resulting in a M x N matrix, M is the number of collected URLs.

About the Data

The data s used for this analysis was collected from https://www.unb.ca/cic/datasets/url-2016.html.

Benign URLs: Over 35,300 benign URLs were collected from Alexa top websites. The domains have been passed through a Heritrix web crawler to extract the URLs. Around half a million unique URLs are crawled initially and then passed to remove duplicate and domain-only URLs. Later the extracted URLs have been checked through Virustotal to filter the benign URLs.
Spam URLs: Around 12,000 spam URLs were collected from the publicly available WEBSPAM-UK2007 dataset.
Phishing URLs: Around 10,000 phishing URLs were taken from OpenPhish which is a repository of active phishing sites.
Malware URLs: More than 11,500 URLs related to malware websites were obtained from DNS-BH which is a project that maintains a list of malware sites.
Defacement URLs: More than 45,450 URLs belong to the Defacement URL category. They are Alexa ranked trusted websites hosting fraudulent or hidden URL that contains both malicious web pages. Each class of URLs is separated, hence the extraction is performed separately for each file and accumulated together for the final training.

The full research paper outlining the details of the dataset and its underlying principles is mentioned in the Reference Section.

Feature Extraction from URLs

URL (an acronym for Uniform Resource Locator) is nothing more than the address of a given unique resource on the Web. In theory, each valid URL points to a unique resource. Such resources can be an HTML page, a CSS document, an image, etc. In practice, there are some exceptions, the most common being a URL pointing to a resource that no longer exists or that has moved. As the resource represented by the URL and the URL itself is handled by the Web server, it is up to the owner of the webserver to carefully manage that resource and its associated URL.

Image Credits: https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL

Each of these anatomical features plays a crucial role in distinguishing URLs.

Numerous techniques for malicious/spam/phishing URL detection exist, which include Blacklisting and AI-aided Techniques. Blacklisting is the process of maintaining a database of known malicious domains and comparing the hostname of a new URL to hostnames in that database. The primary disadvantage with such a system is its ability to detect new and unseen malicious URLs, which will only be added to the blacklist after it has been observed as malicious from a victim. AI-aided Machine/Deep learning approaches provide a predictive approach that is generalizable across platforms and independent of prior knowledge of known signatures. Given a sample of malicious and benign malware samples, ML techniques will extract features of known good and bad URLs and generalize these features to identify new and unseen good or bad URLs.

The external libraries used for the extraction are whois , PyQuery and interruptingcow

Using the whois package we can extract features such as Days Since Registration and Expiration. PyQuery can be used to extract features from the webpage itself, which include but not limited to the number of links, number of titles, number of images, etc.

The above-mentioned code will extract features from a single URL and return a dictionary of feature names as keys and the corresponding values. This processer will be iterated through the different CSV files.

The final feature extracted data along with the class labels can be found here.

The process of extracting features was extremely time-consuming and approximately had a runtime of 96 hours* due to the large nature of the dataset.

Developing a Predictive Model

Based on the extracted features, a predictive model can be developed which will be able to classify an unseen URL. Numerous approaches to Deep Learning models are available, for simplicity, a Sequential Model with only Dense Layers will be trained. You are welcome to experiment with various techniques. The dense layer is the regular deeply connected neural network layer. It is the most common and frequently used layer. The dense layer does numerous weighted operations on the inputs and returns the output.

What is a Neural Network?

It’s a technique for building a computer program that learns from data. It is based very loosely on how we think the human brain works. First, a collection of software “neurons” are created and connected together, allowing them to send messages to each other. Next, the network is asked to solve a problem, which it attempts to do over and over, each time strengthening the connections that lead to success and diminishing those that lead to failure.

Keras Sequential Model

model = Sequential()    
model.add(Dense(256, input_dim = input_dim , activation = 'relu'))    model.add(Dense(128, activation = 'relu'))    
model.add(Dense(64, activation = 'relu'))    
model.add(Dense(32, activation = 'relu'))    
model.add(Dense(16, activation = 'relu'))    
model.add(Dense(5, activation = 'softmax'))
model.compile(loss = 'categorical_crossentropy' ,optimizer='adam' , metrics = 'accuracy' )

There are numerous permutations and combinations for creating a Neural Network, which is later fine-tuned for better performance. This implementation is meant for educational purposes alone and is not fine-tuned for any metrics. Our model will be predicting 5 different labels — Benign, Malware, Phishing, Spam, and Defacement. A multi-class classification model will input the given data and predict the above-defined labels as outputs.

Preparing the Data

One of the best practices in Machine or Deep Learning is to normalize the data when each feature has a varied range. The objective of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.

data = pd.read_csv("feature.csv")
data.replace(True,1,inplace = True)    
data.replace(False,0,inplace = True)
    
y = data["File"]    
data = data.drop(columns = "File")
     
encoder = LabelEncoder()    
encoder.fit(y)    
Y = encoder.transform(y)  
   
scaler = MinMaxScaler(feature_range=(0, 1))    
 
X = pd.DataFrame(scaler.fit_transform(data))

Machine learning and deep learning models, like those in Keras, require all input and output variables to be numeric. All non-numerical values such as True and False are converted to Binary, as for the class labels, LabelEncoder converts the String labels into Numerical range. For eg, Benign:0, Malware:1, Phishing:2, Spam:3, and Defacement:4.Which will further be transformed into categorical binaries of 0's and 1’s. We can import to_categorical() ,from tensorflow.keras.utils to do the same.

Benign : 00001
Malware : 00010
Phishing : 00100
Spam : 01000
Defacement : 10000

We can now start training the model with 80% of the original sample and 20% of the sample to evaluate the model’s performance on new data.

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

Training the Model

With the previously defined Sequential Dense Network, we will now fit (train the model) X_train and y_train.

model.fit(X_train,to_categorical(y_train),epochs = 50,validation_split=0.3, batch_size = 128)

This is relatively light training that can be performed on either CPU or GPU. All tests were done on 2.4 GHz Quad-Core Intel Core i5 with 8 GB 2133 MHz LPDDR3 running OSX Version 11.3.1, the runtime for training was approximately 3 mins.

Testing the Model

There are numerous statistical measures to test the performance of any given Machine / Deep Learning model. A few notable ones include,

Accuracy

Accuracy is a metric for how much of the predictions the model makes are true. The higher the accuracy is, the better. However, it is not the only important metric when you estimate the performance.

Precision

The precision metric marks how often the model is correct when identifying positive results. For example, how often the model diagnoses cancer to patients who really have cancer.

Recall

This metric measures the number of correct predictions, divided by the number of results that should have been predicted correctly. It refers to the percentage of total relevant results correctly classified by your algorithm.

y_pred = model.predict(X_test)    
predicted = np.argmax(y_pred, axis=1)    
target_names = ['Benign','Defacement','Malware','Phishing','Spam']    print(classification_report(y_test, predicted, target_names=target_names))

The below plot consists of the mean of each class’s normalized value portraying the diverse values in each class wrt each feature.

Tensorflow Lite Model

TensorFlow Lite is an open-source deep learning framework for on-device inference and consists of tools that enable on-device machine learning by helping developers run their models on mobile, embedded, and IoT devices.

Here are the pros of conversion to TF-Lite:

Light-weight: Edge devices have limited resources in terms of storage and computation capacity. Deep learning models are resource-intensive, so the models we deploy on edge devices should be lightweight with smaller binary sizes.
Low Latency: Deep Learning models at the Edge should make faster inferences irrespective of network connectivity. As the inferences are made on the Edge device, a round trip from the device to the server will be eliminated, making inferences faster.

Source:https://www.tensorflow.org/lite/convert/index

The Process of conversion is easier than you think,

Load the previously trained model
Choose the optimization method
Voila, a Lite (.tflite) is born

#Loading the Model
model = ts.keras.models.load_model('Model_v2.h5') 
converter = ts.lite.TFLiteConverter.from_keras_model(model)#Optimisation Constraints 
converter.optimizations = [ts.lite.Optimize.DEFAULT] #Conversion and creation
tflite_quant_model = converter.convert()
tflite_model_file = pathlib.Path('tflite_quant_model.tflite')
tflite_model_file.write_bytes(tflite_quant_model)

This conversion does provide a lighter framework but nothing comes for free, this might cost the model's performance.

GUI for URL Classification

We humans, get information about the world around us using our senses. Our vision is one of the most important parts of our communication with things around us. Humans learn to interact with different objects entire life: for some of them, it requires few seconds to learn (like a doorbell knob), some of them require a bit more time (like driving a car).

But with computers we had an issue — people were forced to think abstract, they were forced to deal with a larger amount of commands they even not able to remember.

Hence a Graphical User Interface (GUI), is the most opted method to make the interaction between a computer and human possible in the easiest possible way.

Streamlit is an open-source app framework for Machine Learning and Data Science teams. Create beautiful data apps in hours, not weeks. All in pure Python.

Streamlit provides numerous APIs which is extremely beneficial to build and deploy Apps without the knowledge of backend programming languages such as CSS and JavaScript.

The initial step is to develop a python script to accept a URL string and predict its class.

The pseudo-code for the same would be,

Input URL String from User
Load the Keras Model, Scaler, and Label Encoder
Apply Feature Extraction and Normalisation on the given URL
Predict the class

Command Line Argument based input

The above-mentioned code will take the URL as a Command-Line Argument and display the Class of URL as shown below.

>>> $ python predict_args.py -i http://astore.amazon.co.uk/allezvinsfrenchr/detail/1904010202/026-8324244-9330038Output : Spam

This will act as a skeleton for our GUI implementation via Streamlit.

A few Streamlit APIs we will be using:

streamlit.text(“Hi-Output”)
streamlit.title(“Title”) / streamlit.header(“Header”) / streamlit.subheader(“Subheader”)
streamlit.image( Image.open(loc))

Streamlit consists of two sections in its output GUI- sidebar, and main screen. ‘streamlit.text()’ API will print the output on the main screen whereas, ‘streamlit.sidebar.text()’ will display the same in the sidebar window.

Simple Streamlit GUI implementation

The above script can be run locally on a network with the command, streamlit run <name>.py, and the below picture is the resulting GUI.

The final GUI code is an extension of the above simple implementation coupled with a few if-else conditions to display more information on the type of URL.

Link to the GUI : https://share.streamlit.io/rohith-2/url_classification_dl/main/GUI/gui.py
All the above codes can be found in my Github repository.

Co-Authors : anirudh bhaskar and srikanth ankam

References:

Mohammad Saiful Islam Mamun, Mohammad Ahmad Rathore, Arash Habibi Lashkari, Natalia Stakhanova, and Ali A. Ghorbani, “Detecting Malicious URLs Using Lexical Analysis”, Network and System Security, Springer International Publishing, P467–482, 2016.