Decoding PU Learning: A Fusion of Statistics and Machine Learning

8 min readApr 17, 2023

Let’s start with few blunt questions:

Scenario-1: Despite the prevalence of diabetes, many patients go undiagnosed for years after the onset of the disease. This presents a challenge for healthcare professionals seeking to identify patients who may benefit from early intervention. One potential solution is to develop a classifier that can accurately distinguish between patients who have diabetes and those who do not. However, negative examples can be difficult to define, as there is a wide range of possible factors that could contribute to a patient not having diabetes. To address this, it may be necessary to focus on the core characteristics of diabetes in order to develop a generalizable solution that can accurately identify patients who are not likely to develop the disease. So essentially what you have a set of patients who have diabetes or sure & a very larger set of patients who may or may not have diabetes.
The challenge then becomes how to identify potential cases of diabetes in this larger pool, which is a problem that can be tackled using machine learning techniques.

Scenario-2: Social media platforms like Facebook are designed to show users a personalized selection of content based on their interests and past behavior. However, this means that users are only exposed to a subset of the available content, and may miss out on items that they would have enjoyed if they had encountered them. To address this, it can be useful to consider the hypothetical scenario in which a user has seen all of the available content. This raises the question of which items the user would have liked, given the opportunity.

Scenario-3: Spam email classifiers are a common example of machine learning applications in the real world. However, spammers are aware of these classifiers and may attempt to craft their messages in a way that allows them to bypass the existing filters. This can lead to a situation where the spam messages fall on the other side of the decision boundary used by the classifier, requiring a re-tagging and re-training of the model to accurately identify the new spam messages. However, this process can be time-consuming and impractical to do regularly. Any intelligent alternative?

There could be multiple other scenarios which are built around one common core of having a set of data with a common label (by design positive), and a larger set of data that is completely unlabeled. In this case, the goal is to identify the labels for the unlabeled data, which can be critical for building accurate models and making informed decisions.

There are several techniques that can be used to tackle this problem, such as semi/self-supervised learning, active learning, transfer-learning which can help to identify and prioritize the most informative examples for labeling.

Today, we’ll delve into the elegant and effective approach of PU-learning, which is specifically designed to tackle the challenge of labeling data in a simple yet powerful way.

Over-simplified representation of the data. Remember, during modelling we don’t know the label of green(1) or grey (0). What all we know is points from yellow circles are labeled 1.

Now, given any point x,

We have now deconstructed the complex problem of identifying labels from a pool of unlabeled data into two separate, but still challenging problems. Let’s see.

(1) P( a data-point is labeled) : This is a relatively straightforward task as we have a small portion of the dataset that is labeled and the majority of it that is unlabeled. We can build a classifier to predict the probability of any data point being labeled based on the labeled subset of data.

(2) P (a positive data-point is labeled) : This is not simple. But, little bit of complexity could open the door towards simplicity. be with me.
Let’s play a trick now:

Only difference is, we have hold out a small sample from labeled (positive) data-set during training the classifier-(1). “If” it’s truly random, i.e. “if” it’s true representative of the yellow circle, ideally the model training should not affect (practically it does & we’ll come to that later).
[The multiple ‘if’, ‘representative’(s) are to avoid ‘Expectation’ :P ]

The classifier-(1) is able to estimate the P( a data-point is labeled). If the model tries to predict the probability of hold-out positive-labeled data-points, it’d end up predicting avg. probability of a positive a data-point is labeled i.e. P (a positive data-point is labeled).

Thus, dividing (1)/(2), we get the intended estimate.

Well, some problems or scope playing around?

Where can we make adjustments to maximize the benefits of this approach?

(1) Label-classifier (classifier-(1)) in this article: This could be any model. As best as it could be. From LR to DNN, with performance proper ‘generalization’ would play a role.

(2) Hold-out from positive-labeled data: This aspect is intriguing because it involves two opposing forces. If we hold out a significant portion of labeled data, the classifier-(1) may lose its effectiveness. Conversely, if we hold out only a small sample, the classifier-(1) may learn the correct pattern, but the generalization of the probability that a positive data-point is labeled may still be insufficient. (Heard about variance?)

(3) To tackle challenge-(2), or in general what we can do:
- Try running multiple iterations keeping the set-up same. Then what changes? Every time, the hold-out sample changes. This the classifier-(1) & estimate probability of a positive a data-point is labeled converges towards truth.

(4) After completing the PU-learning epoch, there may be some unlabeled data-points that are identified as confident positive points. In such cases, these data-points can be added to the original positive-data pool and the algorithm can be re-run for further refinement.

Let us now utilize a deliberately oversimplified example of data and code to fully grasp the concept:

# Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score, precision_score, roc_auc_score, accuracy_score, f1_score
from sklearn.preprocessing import MinMaxScaler
import xgboost as xgb

# Load the data
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt"
data = pd.read_csv(url, header=None)

# Assign column names to the data
data.columns = ['F1','F2','F3','F4','Target']

# Define the features
features = ['F1','F2','F3','F4']

# Print the first five rows of the data
print(data.head())

# Split the data into training and testing sets
x_data = data[features]
y_data = data['Target']
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=7)

# Train a classifier to create a baseline
model = xgb.XGBClassifier()
model.fit(x_train, y_train)

# Define a function to evaluate the results
def evaluate_results(y_test, y_predict):
    print('Classification results:')
    f1 = f1_score(y_test, y_predict)
    print("f1: %.2f%%" % (f1 * 100.0)) 
    roc = roc_auc_score(y_test, y_predict)
    print("roc: %.2f%%" % (roc * 100.0)) 
    rec = recall_score(y_test, y_predict, average='binary')
    print("recall: %.2f%%" % (rec * 100.0)) 
    prc = precision_score(y_test, y_predict, average='binary')
    print("precision: %.2f%%" % (prc * 100.0)) 

# Make predictions on the testing set and evaluate the results
y_predict = model.predict(x_test)
evaluate_results(y_test, y_predict)
'''
Classification results:
f1: 99.57%
roc: 99.57%
recall: 99.15%
precision: 100.00%
'''
# Create a copy of the data for PU learning
mod_data = data.copy()

# Extract the indices for the positive examples
index_pos = mod_data[mod_data['Target']==1].sample(frac=0.25).index

# Create a new column as PU_Target
mod_data['PU_Target'] = "Unlabeled"

# Label only 25% of the positive examples as Positive (Rest remains as Unlabeled)
mod_data.loc[index_pos,'PU_Target'] = 'Positive'

# Print the cross-tabulation of the Target and PU_Target columns
print(pd.crosstab(mod_data['Target'],mod_data['PU_Target'], margins=1))
'''
PU_Target Positive Unlabeled All
Target   
0 0 762 762
1 152 458 610
All 152 1220 1372
'''
'''
# Our purpose is to identify the 458 from 1220
'''

# Define a function to fit a PU estimator
def fit_PU_estimator_AM(X,y, hold_out_ratio, estimator):
    
    # Extract the positive elements that we will keep aside
    X_labled_pos = X[y==1]
    X_hold_out   = X_labled_pos.sample(frac=hold_out_ratio)
    
    # Extract the indices of the non-held out elements
    idx_non_hold = list(set(X.index)-set(X_hold_out.index))
    
    # Remove the held out elements from X and y
    X_non_hold = X.loc[idx_non_hold] 
    y_non_hold = y.loc[idx_non_hold]
    
    # Fit the estimator on the unlabeled samples and part of the positive labeled ones
    # In order to estimate P(s=1|X) or  what is the probablity that an element is *labeled*
    estimator.fit(X_non_hold, y_non_hold)
    
    # We then use the estimator for prediction of the positive held-out set 
    # in order to estimate P(s=1|y=1)
    hold_out_predictions = estimator.predict_proba(X_hold_out)[:,1]
    
    # save the mean probability 
    prob_s1y1 = hold_out_predictions.mean()
    return estimator, prob_s1y1

def predict_PU_prob_AM(X, estimator, prob_s1y1):
    predicted_s = estimator.predict_proba(X)[:,1]
    return predicted_s / prob_s1y1

# test the PU estimation approach
report = []

predicted = np.zeros(len(x_data))
learning_iterations = 1001
for index in range(learning_iterations):
    # In each iteraion only hold-out sample is different, thus pu_estimator & probs1y1 is different
    pu_estimator, probs1y1 = fit_PU_estimator_AM(X = mod_data[features], 
                                                 y = mod_data['PU_Target'].map({'Unlabeled':0,'Positive':1}).astype('int'),
                                                 hold_out_ratio = 0.25, 
                                                 estimator = xgb.XGBClassifier())
    predicted_index = predict_PU_prob_AM(mod_data[features], pu_estimator, probs1y1)
#**** predicted_s, probs1y1 being an 'uncalibrated' model output i.e. not a true probability, (predicted_s/probs1y1) is not ensured to remain within [0,1]
# For us, comparison is good-enough. Thus moving ahead with rescaling.
# Calibrated output would give better result.
    predicted_index_scaled = MinMaxScaler().fit_transform(predicted_index.reshape(-1,1)).reshape(-1)
    predicted += predicted_index_scaled
    
    if(index%100 == 0): 
        print(f'Learning Iteration::{index}/{learning_iterations} => P(s=1|y=1)={round(probs1y1,2)}')

# In every iteraion, the learnt classifier-(1) (estimator) is different & so is the P(s=1|y=1)
'''
Learning Iteration::0/1001 => P(s=1|y=1)=0.23000000417232513
Learning Iteration::100/1001 => P(s=1|y=1)=0.20000000298023224
Learning Iteration::200/1001 => P(s=1|y=1)=0.1599999964237213
Learning Iteration::300/1001 => P(s=1|y=1)=0.20999999344348907
Learning Iteration::400/1001 => P(s=1|y=1)=0.15000000596046448
Learning Iteration::500/1001 => P(s=1|y=1)=0.17000000178813934
Learning Iteration::600/1001 => P(s=1|y=1)=0.25999999046325684
Learning Iteration::700/1001 => P(s=1|y=1)=0.15000000596046448
Learning Iteration::800/1001 => P(s=1|y=1)=0.11999999731779099
Learning Iteration::900/1001 => P(s=1|y=1)=0.17000000178813934
Learning Iteration::1000/1001 => P(s=1|y=1)=0.18000000715255737
'''
# Taking avg of multiple iteraion
mod_data['y_pos_pred_proba'] = predicted/(index+1)

# Checking the final probability
pd.pivot_table(mod_data, index='Target', columns='PU_Target', values='y_pos_pred_proba', aggfunc='median')
'''
PU_Target Positive Unlabeled
Target  
0 NaN 0.000349
1 0.65017 0.032135
'''
# For the masked pos-example model emits 100-times more probability than the masked-neg
report = []

# It's interesting to check the model performance at different thrshold
for thre in np.linspace(0.2,0.9,100):
#     print(thre)
    p = precision_score(mod_data['Target'],mod_data['y_pos_pred_proba']>thre)
    r = recall_score(mod_data['Target'],mod_data['y_pos_pred_proba']>thre)
    f = f1_score(mod_data['Target'],mod_data['y_pos_pred_proba']>thre)
    a = accuracy_score(mod_data['Target'],mod_data['y_pos_pred_proba']>thre)

    report.append([thre, p,r,f,a])
    
report = pd.DataFrame(report, columns=['thre','P','R','F','A'])

import matplotlib.pyplot as plt
plt.plot(report['thre'],report['R'], label ='Recall')
plt.legend()
plt.show()

Credit: https://towardsdatascience.com/semi-supervised-classification-of-unlabeled-data-pu-learning-81f96e96f7cb This article served as a helpful guide for me to better understand this complex topic. With the help of the article, I was able to simplify the concepts further and gain a deeper understanding of the codes. I hope that by taking this step to further clarify and simplify the concept, I can help a wider audience better understand and apply the principles of PU learning in their own work.

In conclusion, I acknowledge that my expression in this article may have caused a loss of coherence in the storyline I originally intended. However, I am hopeful that with your support and feedback, I will improve and continue to deliver content that is clear and easy to understand. Thank you for reading.

Be in touch!
LinkedIn: https://www.linkedin.com/in/aniruddhamitra/

Decoding PU Learning: A Fusion of Statistics and Machine Learning

Today, we’ll delve into the elegant and effective approach of PU-learning, which is specifically designed to tackle the challenge of labeling data in a simple yet powerful way.

Written by Aniruddha Mitra