What’s in a domain?

John Hawkins
Towards Data Engineering
4 min readJun 6, 2024

--

The Uniform Resource Locator (URL) is central to the digital world. It is single piece of data that consists of a variable length string (as shown above) that contains a mixture of natural language, machine specific protocols, and sometimes application logic. Many of the structural components illustrated above are optional, yet the overall structure and arrangement of these elements is pre-determined.

We use URLs whenever we read our news, conduct business, and interact with other people across social applications. Below the surface many systems use URLs internally for communicating with other services. These factors make the URL a fundamental data point for understanding anything that happens on the internet.

In many data science projects the URL is the core data point. This includes use cases like malicious website detection and internet content classification where the URL defines the target of the prediction. In many instances the URL is also the source of most of the features. This means that data engineering for internet based problems can benefit from a deeper understanding of how to extract signals from URLs.

At the core of any URL is the domain. The domain may contain natural language information about the intended purpose of a URL. But also contains signals about its legitimacy and the likely country of origin. The domain is also a potential source of additional information through requests to a Domain Name Server (DNS) to understand both its history and the structure of the underlying network topology and resources. Beyond the domain, the URL consists of a sequence of words, categories, identifiers, dates or domain specific abbreviations. This sequence can indicate the psychological elements of how information is categorised, or the internal logic of an application that generates or presents content dynamically.

The complexity of URLs offers enormous opportunity for data engineering. However, extracting many of the specific pieces of information mentioned above requires knowledge about how the URL is structured for a specific application. Meaning that you need to investigate the practices of URL construction that are common in a particular use case, and then exploit that understanding in your data processing.

To facilitate rapid experimentation, we have developed a python library for generating many different kinds of URL based features. In the code snippet below we show how you can install the python library and the CLI to process a CSV and add URL derived features as a new set of columns.

> pip install url2features

The features can be generated in blocks that correspond to structural components of the URL. So that you need only process those that have been proven useful for your project. The full list of options is shown by running the command without any arguments (as shown below):

> url2features
ERROR: MISSING ARGUMENTS
USAGE
url2features [ARGS] <PATH TO DATASET>
<PATH TO DATASET> - Supported file types: csv, tsv, xls, xlsx, odf
[ARGS] In most cases these are switches that turn on the feature type
-columns=<COMMA SEPARATED LIST>. REQUIRED
-simple Default: False. Features derived from the URL string: length, depth, components
-host Default: False. Features about the host including subdoamin and registration (requires internet).
-tld Default: False. Features about the top level domain (TLD)
-protocol Default: False. Features from the URL protocol.
-path Default: False. Features derived from the path between host and file
-file Default: False. Features derived from the final file type
-params Default: False. Features derived from any query string parameters in the URL
-dns Default: False. Features derived from the DNS records (requires internet).
-np Deactivate use of column name prefix. Only works for a single column.

So if you want to process a file called `input.csv` and just add features for the host and top level domain, you would use the following syntax:

> url2features -columns=url -host -tld input.csv > output.csv

The columns parameter is required, so the program knows which column contains your urls. The switches tell it which features to add and finally the data is provided as a path to the csv. By default the program writes the new file to the system’s standard out file stream, so you can either chain it with other tools, or redirect it to a new file as is shown above.

We have used this library to conduct experiments on the impact of different regions within the URL on a variety of URL classification problems. By grouping features extracted from different URL regions we can look at the URL Structure Feature Impact. This is defined as the total feature impact of features from that region. As shown in the Figure below.

The code to generate this plot is available in the experiments section of the url2features source code. It involves running multiple machine learning experiments and assigning each feature to a named region of the url. We then looking at the total weight of feature importance that comes from that url region for the specific problem. We can see that each class of problems extracts signals from different parts of the url, but the domain retains central importance in the majority of instances.

For more information you can check out the associated conference paper for the url2feature library. If you are interested in contributing to developing this library further please reach out.

--

--