Semi-Supervised learning for classification with codes

Understanding self-learning algorithm with example

Mehul Gupta
Data Science in your pocket

--

Photo by Fahrul Razi on Unsplash

In the real world, supervised problems like classification, and regression face a significant limitation i.e. the availability of labeled data to train a model.

My debut book “LangChain in your Pocket” is out now

Labeled data isn’t as easily available as one thinks. Hence, either you need to

Employ a team for data labelling

Or label the data yourself

In this post, we will be walking through how to train a classification problem when we have

A few labelled samples but majority data without a label.

Semi-Supervised learning has been around the corner for some time now and is majorly used to handle tasks where we have ample unlabelled datasets with some labeled samples.

The algorithm we will be discussing is the Self-learning algorithm. Also, I will be denoting labeled data: X & Unlabelled data: Y

Self-Learning

The idea is straightforward,

Train some Classifier C using X as training data

Get prediction probabilities for Y using C

Get top ’n’ samples from Y for which C is very confident (probability for one class is very high say >0.9)

Append these ’n’ samples to X and remove these samples from Y

Hence, if X: 300 samples, Y:1000 samples, and C is confident for 50 samples from Y, the updated X: 350 samples & Y: 950 samples

Run the above 4 steps until C is not very confident of any sample from Y i.e. the probability< Threshold

Finally, train the model using the updated X and get metrics on the validation set

So, if you notice, at every iteration, we are labeling some portion of Y using a model trained on X assuming if the model is very confident on a sample, it would be its actual class.

Would this improve results? Let’s check out the below demo code

  1. Load a dataset. For this exercise, I am generating a fake dataset using sklearn.datasets for a multi-classification problem (4 classes) and 10 features
from sklearn.datasets import make_classification

# Generate fake classification data
n_samples = 1000 # Number of samples
n_features = 10 # Number of features
n_classes = 4 # Number of classes

X, y = make_classification(
n_samples=n_samples,
n_features=n_features,
n_informative=n_features,
n_redundant=0,
n_classes=n_classes,
random_state=42
)

2. Doing the train-test split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train_true, y_test_true = train_test_split(X, y, test_size=0.2, random_state=42)

3. Next, we will remove labels for half of the samples from the training dataset for the sake of demonstration. Hence, we will have labeled samples=400, unlabeled samples=400

X_label,y_label = X_train[:400],y_train_true[:400]
X_unlabelled = X_train[400:]

4. Apply Self Learning algorithm

import pandas as pd
import numpy as np

#sklearn.datasets generate np.arrays. Converting to pandas
X_label = pd.DataFrame(X_label)
y_label = pd.Series(y_label)
X_unlabelled = pd.DataFrame(X_unlabelled)

# Self Learning Algorithm
while True:
model = xgb.XGBClassifier(
objective='multi:softmax',
num_class=n_classes,
random_state=42)

# Train the model
model.fit(X_label, y_label)

# Get probability predictions on unlabeled data
X_unlabelled.reset_index(drop=True, inplace=True)
y_pred = model.predict_proba(X_unlabelled)

# get samples where probability >0.9 for atleast one class
index = [ index for index,x in enumerate(np.max(y_pred,axis=1)) if x > 0.90]
if len(index)==0:
break

temp = X_unlabelled.iloc[index]

# drop high probability samples from unlabeled data and append to labeled data
X_unlabelled.drop(index,inplace=True)

pred = pd.Series(model.predict(temp))

X_label=X_label.append(temp,ignore_index=True)
y_label=y_label.append(pred,ignore_index=True)

5. Final training on updated labeled dataset

model = xgb.XGBClassifier(
objective='multi:softmax',
num_class=n_classes,
random_state=42
)

# Train the model
model.fit(X_label, y_label)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test_true, y_pred)
print("Accuracy:", accuracy)

In comparison, the results improved by 2% by training using a self-learning algorithm on the validation dataset compared to when only the labeled dataset was used for training.

Also, sklearn does have an implementation for the Self-Learning algorithm which can be explored here.

Before wrapping up, another semi-supervised algorithm that is quite popular is

Label Propagation

I have already covered this in one of my previous posts on Graph Algorithms where we assign labels to unlabelled data using majority voting by neighboring labelled samples. You can check how Label propagation works for graphs in the below video. The same logic applies to classification data as well. Sklearn’s implementation can be found here

With this, it’s a wrap. Explore technical vlogs on Data Science below

--

--