A factor that may make your AI/ML application great

This factor is an opportunity to learn from (almost) unlimited data

Maxim Kolosovskiy, PhD
3 min readMay 13, 2022
Photo by Yura Macro on Unsplash

Mining labels in ‘unlabelled’ data

AI (making a machine behave intelligently), or ML specifically, works better when there is more training data available. Additionally, it also works better when the data is more accurate. Normally, a model is learnt based on examples labelled by humans (training dataset), which is often expensive. However, in some cases an ML model can mine labelled examples on their own. If a model can be learnt from ‘unlabelled’ data (i.e. the data that doesn’t require humans to label it first), then such a model can make itself impressively intelligent at lower cost.

Examples

Let’s consider a couple of examples:

  1. The Page Rank algorithm: the better websites are “labelled” by a larger number of inbound links. Thus, a basic web crawler can collect such “labels”.
  2. Video hosting platform and recommended videos: when a user watches videos on a video hosting platform, the user “labels” relevant videos. Thus, such a platform could have tons of labels for recommending the next videos.
  3. AI-based grammar checking: many humans created billions of examples of how a sentence in a given language can look like.
  4. An AI-based chess or Go player: best-ever players of these games ‘labelled’ the best moves in billions of situations.

In all these examples. Humans have natuarally ‘labelled’ the data:

  • by adding links between websites;
  • by choosing what video to watch next;
  • by writing a text;
  • by playing a game.

These “free labels” are hidden treasures. It would be wise to utilise them.

Comparison with a classic ML-based approach

Nowadays, these examples seem trivial. Why discuss it all? I believe that the importance of such hidden free labels is underestimated. A classic approach to ML is to manually create some training dataset and train a model by a state-of-art ML algorithm based on the dataset. On the other hand, one could teach a model to mine such labels and get access to knowledge hidden in a system itself (e.g. the web). To mine such hidden knowledge, one needs to look into the “essence” of a given system.

Let’s come back to the Page Rank example and brainstorm a bit what the best websites should look like:

  • Many visits by users and a lot of time spent on these websites. It’s a fair point, but a search engine cannot track users’ activity to measure these metrics.
  • Just like a good research article is one that is cited by many other articles, a good website is one that has many inbound links from the other websites. Fortunately for a search engine, that stuff can be measured by a basic web crawler. We can go a bit further: a citation by the Science journal and a citation by an unknown new small journal differ and similarly so do “citations” by websites.

Isn’t this just unsupervised learning?

Basically, yes. My point is that we sometimes miss an opportunity to convert a supervised learning problem into an unsupervised one, but instead collect a training dataset and use an ML algorithm to figure this out. As in the PageRank example, one can collect a dataset of websites, rate how good they are, extract some features that may affect the rating and train an ML model that would generalise how the features affect the rating.

Seems straightforward, doesn’t it? However, to train a model for just one language one may need dozens of thousands of examples at a minimum.

Summary

Often a system or its users label it. Figuring out how to mine these labels might be very hard. But I am pretty sure finding these hidden treasures is worth the effort, because one can push the boundaries of what is currently possible in the AI domain.

--

--

Maxim Kolosovskiy, PhD

SWE & Automation Enthusiast @ Google | PhD | ACM ICPC medalist [The opinions stated here are my own, not necessarily those of my employer.]