Predicting domain name category for multiple languages

Published in

Jamf Engineering

4 min readDec 9, 2022

Accurate classification of websites is vital at Jamf, since it supports several of our customer workflows. Protection against potentially harmful content, management of traffic routing, or insight-driven device management are just some examples of applications that rely on domain classification. Having a global customer base poses an intriguing challenge — we need to classify website domains in (almost) any language.

One of the ways in which we classify domains at Jamf is to use Data Science. In this blog post, we zoom in on the problem of how to extract features from domain names when they’re hosted in languages other than English, so that our Machine Learning (ML) pipelines can make correct classification predictions.

So what’s the problem exactly?

Let’s give a simple example to describe the problem at hand in more detail.

Imagine a fictional website with the domain name bestbetworld100.com. The goal is to analyse this domain name and classify (assign a category) the site as Gambling. For an English-speaking reader, the task is most likely trivial. You scan the domain name, quickly notice the terms best and bet, and make an educated guess about the gambling nature of such a site.

But what happens when we ask you to categorise a website called ベットベスト賭.com ? If you are least vaguely familiar with the Japanese script, you can again identify the words bet and best in the domain name and come to the same conclusion as before. On the other hand, if these characters are completely foreign to you, it’s a bit of a challenge.

Obviously, due to the multicultural nature of the Internet, the task of domain classification is not limited to only two languages. In fact, at Jamf the domain names visited by our customers come from a variety of different languages, as showcased by the chart below. Overall, almost half of the traffic we observe is from non-English speaking countries.

Percentage of unique domain names per language for weekly online traffic from 20 most frequent counties

So the question is — how do we build a domain classification ML model such that it is capable to handle not just two, but any number of different languages and scripts we require?

Towards becoming language agnostic

We attempt to reduce the complexity of multi-lingual classification by splitting the process into two stages:

Unify all domain names under one language — English
Train an ML classifier capable of processing the English inputs to predict category labels

English is the most common language used across Internet websites and therefore is our unification language of choice. As a result, we gain a single unified English-based model instead of numerous classifiers (one per each language), in perhaps an ensemble.

While this decision keeps the final classifier relatively simple, it shifts the complexity towards the first stage, responsible for language consolidation. Translating a domain name from any language to English essentially requires the following steps:

Detect the language of a domain name
Split (segment) the domain into meaningful terms
Translate the extracted terms into English

A detailed description of each stage follows. Ideally, after performing the actions above for any selected domain, we should be able to gain output similar to the example below:

Example pre-processing of a non-English domain name

Detecting & translating the language

Once the domain name is in a proper format, we can attempt to identify the language it is written in. To infer (and later translate) the language, we use a combination of two inputs:

Information about a country from the Top Level Domain (TLD) — for example, TLD .fr suggests the French language
Output of a pre-trained language detection/translation model offered by Google

Segmenting domain name

After identifying the language, we need to segment the domain name. This will allow us to extract meaningful terms (words) from an otherwise non-structured sequence of characters that a domain name often is.

We achieve the domain-to-words split by creating a so-called Bag-of-Words representation of a previously identified language. Simply said, this representation is a model that allows us to determine the most likely split of words for a selected domain and language.

Putting it all together

Putting the whole pipeline to work, we can extract meaningful words from domain names in almost any language in our dataset. We later use the terms obtained in this way to acquire characteristics of website domain names in certain categories, such as Gambling mentioned in the introductory example. The learned features are one of the inputs used during the domain classification process.

Closing remarks

Dealing with multi-lingual classification is a challenging problem. Our approach, at least for domain names, is to unite the languages of the inputs and build a single classifier on top of extracted English-based language features.

As a result, the final classifier can be less convoluted. Moreover, the model can also use a more concise feature set, only derived from one language. Inevitably, the data pre-processing required to achieve such feature uniformity is more intricate, but well worth it for our use case due to the reusability of obtained features for different applications.

If you want to read more about multilingual modelling see:

[1] Short introductory InfoQ news article about large scale multilingual models
[2] The BLOOM research project, aimed mostly on multi-lingual text generation
[3] NY Times piece The Great A.I. Awakening detailing how the usage of A.I. changed Google Translate

Predicting domain name category for multiple languages

So what’s the problem exactly?

Towards becoming language agnostic

Putting it all together

Closing remarks

Written by Cvincekova Martina