Website categorization

Published in

8 min readMar 2, 2022

In this article, we will introduce you to our website categorization tool which allows you to easily and highly accurately classify domains/URLs into 703 distinct categories of widely used IAB taxonomy with support for additional taxonomies: Shopify, Google Shopping, eBay, etc.

Our classifiers are being used with great success by Unicorns, multinational companies, startups, analytics platforms and individuals for their AdTech, SaaS, Web Content Filtering, Cybersecurity, Ecommerce and other needs.

You can try out our tool and API for IAB categorization for free at

https://www.websitecategorizationapi.com

Here is example IAB classification of www.apple.com note how the classifier returns several relevant categories:

In addition to our tool and API service, we also provide an offline IAB categorization database of 30 million most popular domains, that were already categorized using IAB taxonomy.

This database covers 99.5%+ of active internet, it includes almost all domains from Google Chrome crUX report.

Our categorization database is used by AdTech companies, for niche research, content filtering and many other purposes.

Check it out here: https://www.websitecategorizationapi.com/url_database.php.

Explainability of machine learning results

One of its cool features is the explainability of results that are produced by the machine learning model used for categorization.

Let us take as an example the website www.allure.com. It is classified by classifier as /Style & Beauty/Beauty (top category is thus “Style & Beauty”, while subcategory is “Beauty”).

In addition to this, the service that we developed also produces the “explanation” by colouring the words that had the highest contribution to resulting classification, as can be seen here:

“beauty”, “makeup”, “care”, “hair”, “skin” are those words based on which classifier decided to assign www.allure.com to category of /Style & Beauty/Beauty.

Here is another example — www.cnn.com:

We can see that presence of words “CNN”, “news”, “weather”, “international”, “politics” and “world” all contributed to resulting classification of website as “News and Politics”.

You can visit the website categorization tool and try out your own texts and websites.

In this article, we will provide detailed information on how to implement website categorization and what are its typical uses cases.

Use cases — cybersecurity, real-time bidding and others

Website categorization has many use cases from a wide range of fields. One important application of website categorization is cybersecurity, where we classify websites into potential spam, phishing or other types of “problematic” websites that we do not want to be visited by e.g. workers or clients on our networks.

Another important use case of website categorization is in marketing. If we want to place our ads on publishers websites then we want them on webpages that are in the same category / vertical as the products / services that we advertise. To be able to do this efficiently we need to have potential partner websites properly categorized.

Website categorization thus plays an important role in Real-time bidding (RTB) which manages the process of publisher making ad inventory available to “eligible” advertiser that post the highest ad bid. If advertiser is relevant, then it can bid via a Demand Side Platform (DSP) and send its advertising content.

Taxonomies or categories used for web site classification

In contexts of ads and marketing, websites are most often categorized by using the taxonomy of Internet Advertising Bureau (IAB), which was developed with marketing/ads in mind. When using it, note that IAB regularly revises their taxonomies, so you should selected the latest version when employing it.

Here is an example of a few categories of websites from IAB taxonomy:

If your website is focused on ecommerce then a different, products oriented taxonomy may be more appropriate for product classification. The most well known ones in this segment are those from Google:

https://www.google.com/basepages/producttype/taxonomy.en-US.txt

and Facebook:

https://developers.facebook.com/docs/marketing-api/catalog/guides/product-categories/

This does not exhaust all possibilities, though. Many online stores, especially e-commerce giants, like Walmart, Target, Rakuten opt for their own custom taxonomies, tailored to their needs.

Benefits of URL categorization — ecommerce use case

Proper ecommerce categorization can have many benefits for online stores. In terms of user experience, it allows users to more quickly find desired items through better search and filtering options.

When grouping products by their categories, the online stores can generate more subpages for indexing in search engines which can lead to more visits from them.

As the rankings are improved by topical content keywords, addition of categories and their keywords can also lead to better signalling for search engine rankings algorithms. In this context, tagging, or adding one or more labels to products can also be beneficial, as it means you add more than just one relevant descriptor to your subpage.

How to approach automated website categorization or how to categorize your website content

Automated website categorization is usually done using a supervised machine learning model (ML) developed specifically for this purpose.

The work on ML solution however first starts with the training data, the quality and size of which is crucial if you want to achieve a high enough accuracy to deploy the website categorization model to production.

A very important part of preparing training data set is choosing a taxonomy that is suited for your task. You can choose from the standard ones like the already mentioned taxonomies of IAB, Google or Facebook or develop a custom one, tailored to your use case.

It is also beneficial to have several levels or so-called Tiers in your taxonomy, ranging from general ones like Apparel to more detailed ones like Dresses.

Going from general categories to more detailed ones is also known as taxonomy path, here is an example of it:

Text pre-processing needed to categorize websites

Pre-processing is an important part of data pipeline when you want to categorize websites. As we are dealing with websites, the first part consists of extraction of relevant text from the websites.

Most websites consist of some central, relevant content (e.g. text of an article) and supplementary parts, like menus, sidebars, footer and similar. In most cases, we do not want the latter to be part of our text used in website categorization.

A typical example of website categorization is that of news articles, where the menus/footer may be common but the topic of the article may change from article to article. We want thus to remove all the non-article elements of webpage as part of so-called article extraction.

Article extractors used in web classification

There has been a lot of research over the years on the topic of article extraction which is an important part of web classification. An interesting approach has been published in widely cited paper “Boilerplate detection using Shallow Features”.

The method consists of using functional elements of websites as features for the machine learning model. Some of the features are:

link density
average sentence length
average word length
number of words in block
relative position in website

Note that links density is generally much higher in menus than in main text, so one can see that this feature can be effective at distinguishing between both.

For those more interested in this paper, it is available at: https://www.researchgate.net/publication/221519989_Boilerplate_Detection_Using_Shallow_Text_Features

and implemented a (Java) library: https://code.google.com/archive/p/boilerpipe/.

If you are more interested in python libraries for article extraction, we achieved good results in the past projects with these two:

goose3 (https://github.com/goose3/goose3)
newspaper (https://github.com/codelucas/newspaper)

More article extractors can be found and their performance evaluated in this benchmark study: https://trafilatura.readthedocs.io/en/latest/evaluation.html

Note that in some use cases it is more appropriate to develop your own article extractor that is tailored to use case.

Machine Learning models for help on how to check website category

Once you have prepared the training data set, the next step is to select the machine learning model to be used.

The best approach is to select a few machine learning models as baseline models (e.g. Support Vector Machines — SVM) and several machine and deep learning models that are more complex, may require more time for training but have potentially higher accuracies.

In our past projects, we have considered both Recurrent Neural Nets, Convolutional Neural Nets for this purpose. Interesting results can also be achieve by combining several neural net models in an ensemble (example implementation: https://arxiv.org/abs/1805.01890).

When considering objective of ML models, one can build them for predicting specific Tier (n) category or one can select as objective the prediction of the complete taxonomy path.

How to deal with localization in context of website classification

If intend to deploy your website classification solution on websites in different languages, there are several approaches to address this.

One possible approach is to build your ML model on training data set that consist from texts in English language. Then, when applying the ML model on a non-English website, you translate the website to English language, using neural machine translation (NMT) models. And send the translated website to the website categorization API afterwards.

Free website categorization API

If you are looking for a free website categorization API, we invite you to check https://www.websitecategorizationapi.com.

Conclusion

Website categorization is an important field in machine learning and natural language processing. It has many use cases, ranging from Cybersecurity to Online Stores Categorizations.

Important part of website categorization is extraction of relevant text from websites (by removing boilerplate elements), where special machine learning models can be used for this purpose.

For text classification itself, a wide range of machine learning models can be used, from standard ones like SVM to more complex ones, like LSTM or transformer models.

Frequently asked questions

How do you categorize the website?

Main pipeline is: fetch the website -> extract the content -> perform text-preprocessing -> send text to machine learning classifier -> display results, either the main category or the full json tree structure, with probabilities for each category.

To fetch the website, if using python, one can use Beautifulsoup for this purpose (https://www.crummy.com/software/BeautifulSoup/bs4/doc/). But many websites have dynamically generated webpages, so the better option in this case is to use Selenium: https://www.selenium.dev.

2. What are website categories?

Website categories are list of categories, which will select to denote their content. Most common list of website categories is the one from IAB and this is also used by our classifiers. If website is from E-commerce domain, however, list of categories from Google or Facebook taxonomy for products is however more appropriate.

Website categorization

Written by SeniorQuant