Introduction to Semi-supervised Learning [H2O.ai][Python]

Published in

Tech Vision

4 min readJan 28, 2017

In supervised machine learning for classification, we are using data-sets with labeled response variable. But when it comes to big data analytics, it is hard to find labeled data-sets. Because as humans, we might take a lot of time to complete the labeling process. So, in this article I am going to explain how we can use semi-supervised learn to overcome this problem.

What is Semi-Supervised Learning?

In simple terms, it is a combination of supervised and unsupervised learning. And also it uses some labeled data with more unlabeled data.

How to label data?

First, we are using K-Means Clustering to cluster the labeled data-set into the number of label types appear in the labeled data-set. Then based on the label data, we can identify which cluster belongs to which label. Then we can label the rest of the unlabeled data-set by submitting for clustering. Now we have a labeled data-set which can be subjected for supervised learning.

Problems With this Approach

We have labeled the data-set using clusters. But do we have an idea that we have done the correct separation between those labels? No, since clustering is an unsupervised learning, it can separate the labels based on some logic that is totally differed from the one that we want.

So, there is a slight risk of going things wrong. If we have identified the correct features that want to make the separation, then there is no problem. If we introduce a feature that does not convey the original idea or irrelevant in separating labels as we want, it can mask the other important features and might generate invalid results.

Identifying correct feature is a must.

Tutorial with Iris Data-set

Let’s get some hands-on experience with iris dataset. Then add following headers to the data-set.

sepal_length,sepal_width,petal_length,petal_width,class

Prerequisites

Python 2.7
Pandas [install relevant dependencies]
H2O.ai [install relevant dependencies]

01 Split Frame

import h2o
# Initialize server
h2o.init()# Load data-set
data = h2o.import_file('iris.csv')# Split input data frame into train, test and validate
train, test, validate = data.split_frame(ratios=[0.1, 0.8])# Save train, test and validate data-sets as csv files
h2o.export_file(frame=train, force=True, path='train.csv')
h2o.export_file(frame=test, force=True, path='test.csv')
h2o.export_file(frame=validate, force=True, path='validate.csv')

02 Create Machine Learning Model for Clustering

import h2o
import pandas as pd
from h2o.estimators import H2OKMeansEstimator

# Initialize server
h2o.init()

# Predefined variables
response_column = 'class'

# Import training dataset
input_data = pd.read_csv('train.csv')
del input_data[response_column]
input_frame = h2o.H2OFrame(input_data)
columns = list(input_frame.col_names)

# Define H2O model
model = H2OKMeansEstimator(k=3)
model.train(x=columns, training_frame=input_frame)

h2o.save_model(model=model, path='', force=True)

03 Cluster Labeling

import h2o
import pandas as pd# Enter path to the model [Generated by you]
model_path = 'KMeans_model_python_1484507298817_1' 

# Calculate weight per label
def calculate_label_value(labels):
    label_value = {}
    for label in labels:
        if label_value.has_key(label):
            label_value[label] += 1
        else:
            label_value[label] = 1

    # Balancing
    for key in label_value.keys():
        label_value[key] = 1.0 / label_value[key]

    return label_value

# Create a panda frame including cluster data
def generate_cluster_data(model_path, input_data, response_data):
    h2o.init()

    input_data = pd.DataFrame(input_data)
    input_frame = h2o.H2OFrame(input_data)
    response_data = pd.DataFrame(response_data)

    cluster_data = pd.DataFrame()

    model = h2o.load_model(model_path)
    predictions = model.predict(test_data=input_frame)
    predictions = predictions.as_data_frame(use_pandas=True)

    cluster_data['cluster_index'] = predictions
    cluster_data['cluster_label'] = response_data

    return cluster_data

# Label cluster based on max vote
def label_cluster(cluster_data, label_values):
    label_values = dict(label_values)
    cluster_data = pd.DataFrame(cluster_data)

    cluster_vote = {}
    for i in range(len(cluster_data.index)):
        index = cluster_data.iloc[i, 0]
        name = cluster_data.iloc[i, 1]

        if cluster_vote.has_key(index):
            if cluster_vote[index].has_key(name):
                cluster_vote[index][name] += label_values[name]
            else:
                cluster_vote[index][name] = label_values[name]
        else:
            cluster_vote[index] = {}

    labeled = [] # Contains all the used labels
    for index in cluster_vote.keys():
        max_vote = 0.0
        max_label = None
        for label in cluster_vote[index].keys():
            if label in labeled:
                continue
            current_vote = cluster_vote[index][label]
            if current_vote > max_vote:
                max_vote = current_vote
                max_label = label
        cluster_vote[index] = max_label
        labeled.append(max_label)

    return cluster_vote


response_column = 'class'
validation_frame = pd.read_csv('validate.csv')
response_column_data = validation_frame[response_column]

# Calculate label values
label_values = calculate_label_value(response_column_data)

# Generate cluster data
input_data = pd.read_csv('validate.csv')
response_data = validation_frame[response_column]
del input_data[response_column]
cluster_data = generate_cluster_data(model_path=model_path, input_data=input_data, response_data=response_data)

# Label cluster
cluster_labels = label_cluster(cluster_data, label_values)
print cluster_labels# Performance test
h2o.init()
input_data = pd.read_csv('test.csv')
response_data = list(input_data[response_column])
del input_data[response_column]
input_frame = h2o.H2OFrame(input_data)

model = h2o.load_model(model_path)
predictions = model.predict(test_data=input_frame)
h2o.export_file(frame=predictions, path='prediction.csv', force=True)
predictions = list(predictions.as_data_frame(use_pandas=True)['predict'])

for i in range(len(predictions)):
    predictions[i] = cluster_labels[predictions[i]]

match_count = 0
for i in range(len(predictions)):
    if predictions[i] == response_data[i]:
        match_count += 1
    else:
        print 'actual :', response_data[i], ' preditc :', predictions[i]# Display erformance Result
print 'match', match_count
print 'mismatch', len(predictions) - match_count

Challenges

Sometimes it is hard to differentiate the clusters as we wanted. In semi-supervised we used unsupervised learning to divide the data into groups. There might not be significantly important features to take the decision like in supervised learning.

If we can outperform the optimization for particular machine learning model, it can reach the strength of supervised learning.