Website categorization API

Website categorization can generally be defined as classifying the website into one or more categories, using (usually) some automated, machine learning solution.

In this article, we will provide detailed information on how to implement website categorization, what are its typical uses cases and provide you with a free tool for website categorization.

Use cases — cybersecurity, real-time bidding and others

Website categorization has many use cases from a wide range of fields. One important application of website categorization is cybersecurity, where we classify websites into potential spam, phishing or other types of “problematic” websites that we do not want to be visited by e.g. workers or clients on our networks.

Another important use case of website categorization is in marketing. If we want to place our ads on publishers websites then we want them on webpages that are in the same category / vertical as the products / services that we advertise. To be able to do this efficiently we need to have potential partner websites properly categorized.

Website categorization thus plays an important role in Real-time bidding (RTB) which manages the process of publisher making ad inventory available to “eligible” advertiser that post the highest ad bid. If advertiser is relevant, then it can bid via a Demand Side Platform (DSP) and send its advertising content.

Taxonomies

In contexts of ads and marketing, websites are most often categorized by using the taxonomy of Internet Advertising Bureau (IAB), which was developed with marketing/ads in mind. When using it, note that IAB regularly revises their taxonomies, so you should selected the latest version when employing it.

Here is an example of a few categories from IAB taxonomy:

If your website is focused on ecommerce then a different, products oriented taxonomy may be more appropriate. The most well known ones in this segment are those from Google:

https://www.google.com/basepages/producttype/taxonomy.en-US.txt

and Facebook:

https://developers.facebook.com/docs/marketing-api/catalog/guides/product-categories/

This does not exhaust all possibilities, though. Many online stores, especially e-commerce giants, like Walmart, Target, Rakuten opt for their own custom taxonomies, tailored to their needs.

Benefits of website categorization — ecommerce use case

Proper ecommerce categorization can have many benefits for online stores. In terms of user experience, it allows users to more quickly find desired items through better search and filtering options.

When grouping products by their categories, the online stores can generate more subpages for indexing in search engines which can lead to more visits from them.

As the rankings are improved by topical content keywords, addition of categories and their keywords can also lead to better signalling for search engine rankings algorithms. In this context, tagging, or adding one or more labels to products can also be beneficial, as it means you add more than just one relevant descriptor to your subpage.

How to approach automated website categorization

Automated website categorization is usually done using a supervised machine learning model (ML) developed specifically for this purpose.

The work on ML solution however first starts with the training data, the quality and size of which is crucial if you want to achieve a high enough accuracy to deploy the website categorization model to production.

A very important part of preparing training data set is choosing a taxonomy that is suited for your task. You can choose from the standard ones like the already mentioned taxonomies of IAB, Google or Facebook or develop a custom one, tailored to your use case.

It is also beneficial to have several levels or so-called Tiers in your taxonomy, ranging from general ones like Apparel to more detailed ones like Dresses.

Going from general categories to more detailed ones is also known as taxonomy path, here is an example of it:

Text pre-processing

Pre-processing is an important part of data pipeline for website categorization models. As we are dealing with websites, the first part consists of extraction of relevant text from the websites.

Most websites consist of some central, relevant content (e.g. text of an article) and supplementary parts, like menus, sidebars, footer and similar. In most cases, we do not want the latter to be part of our text used in website categorization.

A typical example of website categorization is that of news articles, where the menus/footer may be common but the topic of the article may change from article to article. We want thus to remove all the non-article elements of webpage as part of so-called article extraction.

Article extractors

There has been a lot of research over the years on the topic of article extraction. An interesting approach has been published in widely cited paper “Boilerplate detection using Shallow Features”.

The method consists of using functional elements of websites as features for the machine learning model. Some of the features are:

  • link density
  • average sentence length
  • average word length
  • number of words in block
  • relative position in website

Note that links density is generally much higher in menus than in main text, so one can see that this feature can be effective at distinguishing between both.

For those more interested in this paper, it is available at: https://www.researchgate.net/publication/221519989_Boilerplate_Detection_Using_Shallow_Text_Features

and implemented a (Java) library: https://code.google.com/archive/p/boilerpipe/.

If you are more interested in python libraries for article extraction, we achieved good results in the past projects with these two:

More article extractors can be found and their performance evaluated in this benchmark study: https://trafilatura.readthedocs.io/en/latest/evaluation.html

Note that in some use cases it is more appropriate to develop your own article extractor that is tailored to use case.

Machine Learning models

Once you have prepared the training data set, the next step is to select the machine learning model to be used.

The best approach is to select a few machine learning models as baseline models (e.g. Support Vector Machines — SVM) and several machine and deep learning models that are more complex, may require more time for training but have potentially higher accuracies.

In our past projects, we have considered both Recurrent Neural Nets, Convolutional Neural Nets for this purpose. Interesting results can also be achieve by combining several neural net models in an ensemble (example implementation: https://arxiv.org/abs/1805.01890).

When considering objective of ML models, one can build them for predicting specific Tier (n) category or one can select as objective the prediction of the complete taxonomy path.

How to deal with localization

If intend to deploy your website categorization solution on websites in different languages, there are several approaches to address this.

One possible approach is to build your ML model on training data set that consist from texts in English language. Then, when applying the ML model on a non-English website, you translate the website to English language, using neural machine translation (NMT) models. And send the translated website to the website categorization API afterwards.

Free website categorization API

If you are looking for a free website categorization API, we invite you to check https://www.websitecategorizationapi.com.

Conclusion

Website categorization is an important field in machine learning and natural language processing. It has many use cases, ranging from Cybersecurity to Online Stores Categorizations.

Important part of website categorization is extraction of relevant text from websites (by removing boilerplate elements), where special machine learning models can be used for this purpose.

For text classification itself, a wide range of machine learning models can be used, from standard ones like SVM to more complex ones, like LSTM or transformer models.

--

--

--

Website categorization — API, machine learning models, taxonomies, article extractors

Recommended from Medium

Profiling Top Kagglers: Martin Henze (AKA Heads or Tails), World’s First Kernels Grandmaster

What is Algorithmic Bias?

Who’s the Center of the Hallmark Movie Universe?

A Recommendation Pipeline In Production

My Career Journey So Far

Which Part of Your Data Analytics Dollar is Wasted?

Experiential Learning of practicum projects: A boon for emerging analysts

Fast Provisioning of data through Data Virtualization in the Era of ever-increasing Data Fluidity

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
SeniorQuant

SeniorQuant

Ph.D. in Theoretical Physics, Senior Data Scientist

More from Medium

Understanding Natural Language Processing- A case study for Autonomous Vehicle (AV): Part3

Chinese Word Segmentation with code

Stemming In Natural Language Processing NLP — Data Science

Stemming In Natural Language Processing NLP — Data Science

Quick Tip: How to save spaCy dependency graphs as png