Supervised, unsupervised and more: An introduction to lesser-known machine learning methods and when to use them

By Jamila Rejeb, Data Scientist at LittleBigCode 🚀

At LittleBigCode, we strongly believe that there is more to machine learning than just using supervised and unsupervised learning algorithms. That’s why we decided to dedicate this post to exploring all the lesser-known learning methods that can be incredibly useful if used wisely. In this article, we will discuss all types of machine learning algorithms, explain each of them, and discuss the best ways to use them.

Introduction

Machine learning techniques are used nowadays in multiple fields, from fraud detection to tiktok videos recommendation, these techniques are more and more improved and adapted.

At its most basic, machine learning uses programmed algorithms that receive and analyse input data to predict output values within an acceptable range. As new data is fed to these algorithms, they learn and optimize their operations to improve performance, developing ‘intelligence’ over time.

To do so, engineers use different types of machine learning algorithms, including supervised, semi-supervised, unsupervised, and reinforcement learning. In this article, we will take a use case and adapt it to each algorithm to better understand when to use each one and grasp the difference between each method. For this, let’s assume we have a dataset of documents that we need to group. We will see, depending on the metadata we have, how each algorithm can be used.

0. Supervised and unsupervised learning

In every supervised learning project, we have a set of labeled data. The goal is to train an algorithm to generalize from these labels to create a function that can automatically label new data. In our use case, documents will have labels that represent the name of the group the data belongs to, and the goal would be to learn how to map each new document to one of these groups.

On the other hand, unsupervised learning aims to discover hidden patterns or data groupings without the need for human intervention. Therefore, the data we have will not have any tag or label. Let’s go back to our use case and assume that we don’t have any predefined groups to use to classify new data. In that case, unsupervised learning can help us cluster the documents based on their similarities.

1. Reinforcement Learning

Reinforcement learning is a field of machine learning in which intelligent agents take actions to maximize a certain reward. Unlike supervised learning, the agents don’t have a labeled dataset. Instead, the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

Agent-environment interaction loop

To understand the terminology, let’s take the example of training a dog to do tricks. Your intelligent agent would be your dog, the environment would be the garden, for example, where the agent learns and decides what actions to perform. The action is what the dog can do, such as running or jumping. The state is where the dog is, which varies every time your agent takes an action: if your dog is under a tree and then it starts running it will lead to a new state. If your dog manages to do the trick you want to teach him, you reward him with a good meal. During this process, your dog will be learning a decision-making function (control strategy) that represents a mapping from situations to actions. This function is called the policy.

There are different types of reinforcement learning algorithms, but they are mostly divided into two families: model-based RL and model-free RL. Model-based learning attempts to model the environment, then choose the optimal policy based on its learned model. However, in model-free learning, the agent relies on trial-and-error experience to set up the optimal policy without trying to model the environment.

IIn order to get a deeper understanding of RL algorithms and how to use them, we suggest reading the OpenAI website , which provides a comprehensive introduction to the main RL algorithms.

Use case :

If you recall our previous project, we were tasked with grouping documents into different classes. In this example, our data is not labeled, meaning we don’t have predefined groups, but we do have a set of features that describe every document and a reader who reads every document and rewards the algorithm if the predicted group is accurate. The actions in this case would be to predict if a document belongs to class C1, C2, … Cn.

2. Semi-supervised learning

Semi-supervised learning is a learning problem that involves a small number of labeled examples and a large number of unlabeled examples.

Learning problems of this type are challenging as neither supervised nor unsupervised learning algorithms are able to make effective use of the mixtures of labeled and unlabeled data. As such, specialized semi-supervised learning algorithms are required.

The intuition behind this type of learning is to label more data automatically using self-labeling techniques or through the help of a human expert. For further reading about this field, we suggest these books :

Use case :

Back to our document classification task, let’s imagine that only 20% of the documents we have are labeled. In this case we cannot use unsupervised learning since we already have the groups we want to use to classify our data and we cannot use supervised learning because the labeled data that we have is not sufficient to train a supervised learning model. In this case, using semi-supervised techniques will help us label more data based on the tags we already have, in order to generalize well.

3. Active learning

Active learning is a form of semi-supervised learning. Unlike fully supervised learning, the ML algorithm is only given an initial subset of human-labeled data out of a larger, unlabeled dataset. The algorithm processes that data and provides a prediction with a certain confidence level. Anything below that confidence level will signal that more data is needed. These low-confidence predictions will be sent to a person to label the requested data and provide it back to the algorithm. The cycle repeats until the algorithm is trained and operating at desired prediction accuracy. This iterative human-in-the-loop method is built on the idea that not all samples are valuable for learning, so the algorithm chooses the data it learns from. A key difference in active learning is the sampling method used, which significantly affects how the model performs. Data scientists can test different sampling methods to select the one that produces the most precise results. Overall, active learning relies less on data annotation by people compared to fully supervised learning because not all of the dataset requires annotation, only the data points requested by the machine.

Use case :

Just like the previous example, in an active learning situation, the quantity of labelled data is not sufficient and we need to label more. The algorithm will then sample some documents to send them to a human expert who will give their feedback.

Active learning loop [source]

4. Weak supervised learning

Weak supervision has some similarities — and some very important differences — to rule-based classifiers. The obvious similarity is that the inputs to each look like rules (i.e., simple functions that output labels or predictions). The important difference between them is that the rule-based classifier stops there : the rules are the classifier. Such systems are generally brittle because they do not generalize to other examples, even ones that are very similar to those that are labeled by one or more rules.

With weak supervision, on the other hand, the rules or “labeling functions” are used to create a training set for a machine-learning-based model. That model can be much more powerful, utilize a much richer feature set, and take advantage of other state-of-the-art techniques in machine learning, such as transfer learning from foundation models. As a result, the model is generally much more robust than a corresponding rule-based classifier.

The primary difference, though, is that semi-supervised learning propagates knowledge (“based on what is already labeled, label some more”) whereas weak supervision injects knowledge (“based on your knowledge, label some more”)

Snorkel is a labeling tool that uses weak supervision to “accelerate time to value with a transformative approach to data-centric AI powered by programmatic labeling”

Weak supervised learning pipeline [source]

Use case :

Let’s go back to our documentation classification task. In a weak supervised learning context, the approach would involve using a set of rules to generate labels for our data. For example, all documents containing the words “science,” “scientists,” and “physics” could be labeled as “scientific article.” Although this rule may not always be accurate, it could still provide the correct label for about 80% of the documents.

5. Online learning

Most machine learning models use batch learning (also called offline learning) when training: data is sent in bulk through the model, and not point by point. In each learning step, the algorithm examines all the data points of the batch. However, when using online learning, models see one data point at a time.

Online learning is a common technique used in areas of machine learning where it is computationally expensive or infeasible to train over the entire dataset, requiring the use of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, such as in stock price prediction. Online learning algorithms may be prone to catastrophic interference.

Let’s take the example of a neural network. As the network goes through the data batches, it dynamically creates the aforementioned pathways between nodes during the training phase. These pathways are built from the data that is being fed to the machine.

Therefore, when you feed it new information, new pathways are formed, sometimes causing the algorithm to “forget” the previous tasks it was trained for. Sometimes, the margin of error increases, but other times, the machine completely forgets the task. This is what’s called Catastrophic Forgetting or Catastrophic Interference. This phenomenon is more frequent in an online learning context.

6. Incremental learning

During incremental learning, the model learns each example as it arrives. As explained above, classical batch machine learning approaches, in which all data is simultaneously accessed, do not meet the requirements to handle the sheer volume in the given time, leading to more and more accumulated unprocessed data. Furthermore, they do not continuously integrate new information into already constructed models, but instead regularly reconstruct new models from scratch. This leads to potentially outdated models. Incremental learning aims to continuously learn from data as soon as it arrives, leading to better prediction quality.

There is a lot of ambiguity involved in the definition of incremental and online learning in the literature. Some authors use them interchangeably, while others distinguish them in different ways. Additional terms such as lifelong- or evolutionary learning are also used synonymously. We define an incremental learning algorithm as one that generates on a given stream of training data, whereas online learning can be just a type of learning related to how we go through our dataset.

Multiple incremental learning algorithm are implemented in scikit-multiflow package that we invite you to play with.

Incremental vs static learning [source]

Use case :
Returning to our document classification project, in this use case, our project is already in production and new data is received every day. Some of the data has labels with new classes that we didn’t have before. In this case, to maintain the accuracy of our model’s predictions, we can use incremental learning to continuously learn and adapt our model.

7. Transfer learning:

Transfer learning involves taking the knowledge gained from one task and applying it to another task. It is a common approach used in pre-trained models that need to be fine-tuned for specific purposes. For example, if a model needs to learn how to recognize dogs, the knowledge gained from a model trained to detect cats can be used. The major advantage of transfer learning is the reduced training time and generalization error, as the pre-trained weights are included in the new model. This also helps reduce the dataset size needed for training, reducing the resources required to train the model end-to-end.

However, pre-trained models may perform poorly in some cases, such as when datasets are not similar, features transfer poorly, or when the high-level features learned by the bottom layers are not sufficient to differentiate the classes in the problem. In such cases, fine-tuning is necessary. Fine-tuning involves unfreezing the entire model (or a part of it) and retraining it on the new data with a very low learning rate. This increases the model’s performance on the new dataset while preventing overfitting. Feature transferring can also be used, where the input and feature-extraction layers trained with a given dataset (with their weights and structure frozen) are used to train a new classification layer for a related problem domain, or additional feature layers are added on top of the existing ones. This method is ideal if the two problem domains are similar. Generally, if new target labels are scarce, we avoid fine-tuning to prevent overfitting.

Transfer learning in a CNN-based image classifier

Summary

If you stumbled upon this article, you are probably looking for new methods to tackle a new ML project or are curious about other machine learning methods. This article aims to broaden a data scientist’s knowledge of machine learning techniques. We have deliberately chosen not to detail every method to keep the article easy to read.

We hope this quick overview helps the reader explore which tools might be most suitable for their problem. As each technique has its own characteristics, advantages, and drawbacks, we urge you to test each of them in a small notebook to better understand their differences, strengths, and even experiment with different combinations as described in this article.

Consult all the articles of LittleBigCode by clicking here: https://medium.com/hub-by-littlebigcode

Follow us on Linkedin & Youtube + https://LittleBigCode.fr/en

--

--