What is Semi-Supervised Learning? A Guide for Beginners

Data Science Wizards
7 min readApr 14, 2023

--

Imagine a scenario with a dataset including one million records of users’ demographics. Based on that, you want to build a machine-learning model that can tell about the sex of the user. Now in this data, you have only 100k records provided with sex labels, and labelling the rest of the records is time-consuming.

What should anyone do in such a scenario? First, we can simply train a model using the 100k labelled records, and then we can use this trained model to predict the labels of the rest of the unlabelled records. This way, we can resolve our problem of labelling a vast amount of data, which is already expensive and time-consuming.

The above-given solution is a type of machine learning called semi-supervised learning. This article will discuss this type of machine learning in more detail using the points below.

Table of Content

  1. What is Semi-Supervised Learning(SSL)?
  2. Working behind Semi-Supervised Learning
  3. Techniques Used in Semi-Supervised Learning

What is Semi-Supervised Learning(SSL)?

Semi-supervised learning is one of the machine learning methods that use supervised machine learning methods to label the data. As discussed above, using a small size of labelled data to predict the labels of unlabelled data makes it different from the supervised and unsupervised machine learning methods. The below image can represent a basic difference between supervised, unsupervised and semi-supervised machine learning methods.

Let’s look into a basic introduction to these types of machine learning methods.

Supervised learning

Using the labelled data makes it different from the other machine learning methods, this type of learning involves training machine learning models and algorithms on labelled data and making predictions based on the training.

This type of learning comes with its limitations of working slowly and dependency on manual human work to give labels to humans. This type of learning is a costly procedure to follow as it requires huge computation power as well as cost increases as humans are required to label the data.

Unsupervised Learning

In this type of learning, we try to find hidden patterns, differences and similarities between different records using some algorithms and models. There are mainly two types of unsupervised learning methods. First, we make clusters of data records, and second, we reduce the size of the data to see most of the data in fewer dimensions.

However, unsupervised learning methods are cheaper and affordable to implement, but they are limited to less accurate results. While applying these methods and models, we need to take care of the process for an optimum result.

Semi-Supervised Learning

In the above, we got introduced to two main types of learning, and as the name suggests, this can be thought of as the bridge between the above two types of learning. With this approach to modelling data, we need to train a model using a supervised learning approach and apply that model to label the larger dataset.

One limitation of this type of learning is that we need some of the data labelled while applying supervised learning approaches. To do so, we need first to apply a manual labelling procedure and then use supervised machine learning approaches with it. Using this type of learning, one can reduce the cost of manual annotation of the data.

Unlike unsupervised learning, these methods are used for a variety of problems, from clustering and association to classification and regression. Since unlabelled data is easy to get and inexpensive, semi-supervised learning finds itself as the best solution to many applications while also ensuring higher accuracy if parameters are defined appropriately.

Working behind Semi-Supervised Learning

To work with unlabelled data using Semi-Supervised Learning, we need a relationship between the variables in the dataset. The following assumptions can be considered while working with semi-supervised learning methods:

  • Continuity: according to this assumption, data points near other data points should share the same group. However, we also use this assumption in supervised learning. In semi-supervised learning, the assumption of smoothness is incorporated into the decision boundaries in regions where there is a low density of labelled data points.
  • Cluster: according to this, data need to be divided into different groups as well as the data points of the same cluster should share the same labels.
  • Manifold: The manifold assumption posits that the data, which may exist in high-dimensional input space, actually resides on a lower-dimensional manifold and that distances and densities can be effectively measured on this manifold.
  • The process of data generation should go through a lesser degree of freedom, so the data generated can be hard to model.

Techniques Used in Semi-Supervised Learning

There are various techniques using which we can perform semi-supervised learning, some of them are as follows:

Pseudo-Labelling

Using this technique, we train a model on the same labelled data then we use the model to approximate the labels for unlabelled data. Let’s take a look at the below image.

In the above, we can see that there are three stages using which we can perform pseudo-labelling:

  1. Train a model with some labelled data.
  2. Predict the labels using the trained model to generate pseudo labels.
  3. Finally, retrain the model using the pseudo-labelled data and original labelled data.

To get more accurate results, we iteratively perform this process and because of iteration model accuracy improves, and the process performs with a higher degree of accuracy.

Self-training

We can think of this process as a variation of the pseudo-labelling process, as we know that there is no boundary defined for confidence. Here in this process, we only accept the predicted labels with a high confidence level, and by iterating so many times, we get labels with more accuracy.

This process also goes through three steps; where in the first step, we simply follow the procedure of the first step of pseudo-labelling and after getting labels for unlabelled data.

His involves using a model to make predictions on unlabeled data and then using those predictions as labels for the unlabeled data. We can then take the most confidently predicted labels and add them to our training data along with our original labelled data.

After combining the labelled and pseudo-labelled data, we retrain our model on the augmented dataset. We can repeat this process several times, gradually increasing the amount of labelled data and improving the model’s performance on the task at hand. By iterating in this way, we can obtain highly confident labels and train a more accurate model.

As we know now that the process goes through many iterations, and talking about the standards, we perform this till the 10th iteration. As the number of labelled data increases, the performance of the model also increases.

Label Propagation

We can think of this method as a graph-based transductive method to generate pseudo labels for unlabeled data. By seeking labels of the neighbours in a graph structure, unlabeled data points adopt the labels.

There are a few assumptions we need to consider while using this process:

  • Information about all classes should be available in the labelled data.
  • Close data points should have similar labels.
  • Data points sharing the same cluster will have the same label.

In this process, a fully connected graph gets generated where the nodes are all the data points, whether they are labelled or unlabelled. The edges between the nodes can be considered as the weight in the form of distance. However, distance and weight are inversely proportioned. A large weight ensures that the label can travel easily through the model.

Here we can see a simple explanation of this method.

The workflow of the process is as follows:

  • All the data points(nodes) are assigned a label based on the distribution in space, which we call the soft labels.
  • Via edges, node labels propagate through all the other nodes.
  • Labels of the nodes get updated every time a label reaches a node and adopts a final label based on the maximum number of nodes in its neighbourhood after many iterations.

This process stops when every unlabelled data node has the majority label of its neighbour or the iteration number reaches a predefined mark.

Final words

Here we go to learn many important essential points about semi-supervised learning. As in today’s scenario, a massive amount of data is generated regularly, so it becomes important to use them appropriately. Semi-supervised learning finds its application in a broader area because clean, labelled and valuable data is always a need in the data science space.

One thing which makes this type of learning unique and essential in the machine learning domain is that it can utilise any supervised machine learning method with a small number of modifications. Therefore, only a few things we need here to consider before applying semi-supervised learning, such as the portion of labelled data that should represent the distribution of all classes.

About DSW

DSW, specializing in Artificial Intelligence and Data Science, provides platforms and solutions for leveraging data through AI and advanced analytics. With offices located in Mumbai, India, and Dublin, Ireland, the company serves a broad range of customers across the globe.

Our mission is to democratize AI and Data Science, empowering customers with informed decision-making. Through fostering the AI ecosystem with data-driven, open-source technology solutions, we aim to benefit businesses, customers, and stakeholders and make AI available for everyone.

Our flagship platform ‘UnifyAI’ aims to streamline the data engineering process, provide a unified pipeline, and integrate AI capabilities to support businesses in transitioning from experimentation to full-scale production, ultimately enhancing operational efficiency and driving growth.

--

--

Data Science Wizards

DSW, specializing in Artificial Intelligence and Data Science, provides platforms and solutions for leveraging data through AI and advanced analytics.