Predicting domain name category for multiple languages
Accurate classification of websites is vital at Jamf, since it supports several of our customer workflows. Protection against potentially harmful content, management of traffic routing, or insight-driven device management are just some examples of applications that rely on domain classification. Having a global customer base poses an intriguing challenge — we need to classify website domains in (almost) any language.
One of the ways in which we classify domains at Jamf is to use Data Science. In this blog post, we zoom in on the problem of how to extract features from domain names when they’re hosted in languages other than English, so that our Machine Learning (ML) pipelines can make correct classification predictions.
So what’s the problem exactly?
Let’s give a simple example to describe the problem at hand in more detail.
Imagine a fictional website with the domain name bestbetworld100.com. The goal is to analyse this domain name and classify (assign a category) the site as Gambling
. For an English-speaking reader, the task is most likely trivial. You scan the domain name, quickly notice the terms best
and bet
, and make an educated guess about the gambling nature of such a site.
But what happens when we ask you to categorise a website called ベットベスト賭.com ? If you are least vaguely familiar with the Japanese script, you can again identify the words bet
and best
in the domain name and come to the same conclusion as before. On the other hand, if these characters are completely foreign to you, it’s a bit of a challenge.
Obviously, due to the multicultural nature of the Internet, the task of domain classification is not limited to only two languages. In fact, at Jamf the domain names visited by our customers come from a variety of different languages, as showcased by the chart below. Overall, almost half of the traffic we observe is from non-English speaking countries.
So the question is — how do we build a domain classification ML model such that it is capable to handle not just two, but any number of different languages and scripts we require?
Towards becoming language agnostic
We attempt to reduce the complexity of multi-lingual classification by splitting the process into two stages:
- Unify all domain names under one language — English
- Train an ML classifier capable of processing the English inputs to predict category labels
English is the most common language used across Internet websites and therefore is our unification language of choice. As a result, we gain a single unified English-based model instead of numerous classifiers (one per each language), in perhaps an ensemble.
While this decision keeps the final classifier relatively simple, it shifts the complexity towards the first stage, responsible for language consolidation. Translating a domain name from any language to English essentially requires the following steps:
- Detect the language of a domain name
- Split (segment) the domain into meaningful terms
- Translate the extracted terms into English
A detailed description of each stage follows. Ideally, after performing the actions above for any selected domain, we should be able to gain output similar to the example below:
Detecting & translating the language
Once the domain name is in a proper format, we can attempt to identify the language it is written in. To infer (and later translate) the language, we use a combination of two inputs:
- Information about a country from the Top Level Domain (TLD) — for example, TLD
.fr
suggests the French language - Output of a pre-trained language detection/translation model offered by Google
Segmenting domain name
After identifying the language, we need to segment the domain name. This will allow us to extract meaningful terms (words) from an otherwise non-structured sequence of characters that a domain name often is.
We achieve the domain-to-words split by creating a so-called Bag-of-Words representation of a previously identified language. Simply said, this representation is a model that allows us to determine the most likely split of words for a selected domain and language.
Putting it all together
Putting the whole pipeline to work, we can extract meaningful words from domain names in almost any language in our dataset. We later use the terms obtained in this way to acquire characteristics of website domain names in certain categories, such as Gambling
mentioned in the introductory example. The learned features are one of the inputs used during the domain classification process.
Closing remarks
Dealing with multi-lingual classification is a challenging problem. Our approach, at least for domain names, is to unite the languages of the inputs and build a single classifier on top of extracted English-based language features.
As a result, the final classifier can be less convoluted. Moreover, the model can also use a more concise feature set, only derived from one language. Inevitably, the data pre-processing required to achieve such feature uniformity is more intricate, but well worth it for our use case due to the reusability of obtained features for different applications.
If you want to read more about multilingual modelling see:
- [1] Short introductory InfoQ news article about large scale multilingual models
- [2] The BLOOM research project, aimed mostly on multi-lingual text generation
- [3] NY Times piece The Great A.I. Awakening detailing how the usage of A.I. changed Google Translate