Using suitable machine learning algorithms to predict disease phenotypes-supervised learning-1

James Dai
6 min readFeb 4, 2023

--

Artificial intelligence has revolutionized the healthcare system in recent five years. As a medical practitioner and a coder, I deeply feel that this is going to continue its power for decades. For doctors who want to make a small contribution to this field, I will share my experience here.

Introduction

In recent years, I have been involved in many published articles on Pubmed about using machine learning (ML) models for disease predictions. In addition to a current international project of Deep Learning, what I have done previously was: 1. using radiomics to predict local response to stereotactic body radiotherapy in hepatocellular carcinoma patients; 2. using some well-established feature selection methods (Cox regression filtering, LASSO, Ridge and ElasticNet…) to select candidate variables for model construction. In addition to disease-control data imbalance and small sample size that had to be solved by identifying or “generating” more data, the problem I was faced with was the selection of appropriate ML algorithms for a specific research purpose, or data structure/dimension. Therefore, I will go through some common ML algorithms and talk about their strength and weakness. I will also address some new models suitable for biomedical research.

Supervised ML algorithms

ML is a technique that involves algorithms that are programmed to learn and improve based on input data. This leads to more accurate predictions over time. The algorithms can be divided into three categories: supervised, unsupervised, and semi-supervised.

In supervised machine learning, the algorithm is first trained using a labeled training dataset. The trained algorithm is then applied to an unlabeled test dataset to categorize it into groups. The figure below demonstrates how supervised machine learning categorizes lupus nephritis and non-lupus nephritis patients. This type of machine learning is suitable for two types of problems: classification and regression. In classification, the output variable is discrete, like “mutation” or “wild-type”. In regression, the output is a real value, like the risk of developing a disease or risk of death within certain period of time.

Image by Author

Logistic regression

Logistic Regression (LR) is a commonly used method for supervised classification in ML. It is an extension of regular regression and models binary variables, such as the occurrence or non-occurrence of an event. LR calculates the probability of a new instance belonging to a particular class. The result is a value between 0 and 1, which can be used to classify the instance into either class A or B by setting a threshold (e.g., if the probability is above 0.50, the instance is classified as class A, otherwise class B). LR can also be generalized to model categorical variables with more than two values, called multinomial logistic regression.

To understand LR, we have to review the concept of maximum likelihood in my previous article.

Basically, using LR to predict classes is to identify an optimal sigmoid line that fits the data well. So, the probability of generating the observed data is the highest when deriving results from likelihood function.

It is simple to explain, easy to use, and is also the field of ML. Let’s dive into the examples directly to see how well LR predicts our targets.

Using Inflammasome-related genes to predict survival in early-stage non-small cell lung cancer (NSCLC) using R

setwd('data_pathway')
df <- read.csv('Discovery_IFS_death.csv', row.names = 1)

This dataset comprises 467 resected lung tumor samples from Gene Expression Omnibus (GEO) with normalized expression of 138 core inflammasome genes dervied from DNA microarray platform.

For the target definition, 0 is for being alive within two years and 1 is for being dead within two years. Then what we aim to do is to construct a multivariate LR model to predict whether a patient is alive or not.

Here is the simple codes:

logistic_lung <- glm(dat_group~., data = df, family = 'binomial')
summary(logistic)
  1. glm stands for generalized linear model.
  2. Set the family to “binomial” indicates we are doing logistic regression.
  3. Set our dat_group by using “~.”. It means we want to use all variables to predict the survival.

Here is what we get:

Notes: the first estimate for intercept is the intercept of y axis when risk of death is log-transformed (by logit function)

Deviance residual: represents the square root of the contribution that each data point has to the overall Residual deviance

The 2nd to final estimates are slope for each variate.

We can see the deviance residual is around zero, and the distribution is quite symmetric, which suggest a nice result.

Then we get a picture of variables and their significance. We found EREG, NPY1R, RAB32, PTGER4, STEAP4, LACC1, PPP1R12B, ATP10D, NNT, CAMK2B, NLRC5 and MEFV are significant inflammasome genes for survival.

head(predict(logistic_lung, type = "response"))
#type="response" option tells R to output probabilities of the form P(Y = 1|X)

If we want the output to be “1” or “0”:

logistic_lung_pred = ifelse(predict(logistic_lung, type = "response") > 0.5, 1, 0)

Once we have classifications, we can calculate metrics such as the classification error rate.

calc_survival_err = function(actual, predicted) {
mean(actual != predicted)
}
#This calculates the proportion of wrong predictions

calc_survival_err(actual = df$dat_group, predicted = logistic_lung_pred)

We get: 0.1991435, which is not bad!

Then we can obtain more metrics:

train_tab = table(predicted = logistic_lung_pred, actual = df$dat_group)

library(caret)
train_con_mat = confusionMatrix(train_tab)

c(train_con_mat$overall["Accuracy"],
train_con_mat$byClass["Sensitivity"],
train_con_mat$byClass["Specificity"])

What if we use only one variate for prediction?

We first check whether using MEFV only can result in significant result:

logistic_lung_mefv <- glm(dat_group~MEFV, data = df, family = 'binomial')
summary(logistic_lung_mefv)

It seems that this gene might be potential in predicting the survival for NSCLC patients.

logistic_lung_pred = ifelse(predict(logistic_lung_mefv, type = "response") > 0.5, 1, 0)

calc_survival_err(actual = df$dat_group, predicted = logistic_lung_pred)

train_tab = table(predicted = logistic_lung_pred, actual = df$dat_group)

train_con_mat = confusionMatrix(train_tab)

c(train_con_mat$overall["Accuracy"],
train_con_mat$byClass["Sensitivity"],
train_con_mat$byClass["Specificity"])

However, even though the sensitivity is improved a little bit, the accuracy and specificity decrease dramatically. It suggests that incorporating all significant factors is important in improving the model performance.

We then check how poor the LR model predicts the target based on MEFV:

plot(dat_group ~ MEFV, data = df, 
col = "darkorange", pch = "|", ylim = c(-0.2, 1),
main = "Using Logistic Regression for Classification")
abline(h = 0, lty = 3)
abline(h = 1, lty = 3)
abline(h = 0.5, lty = 2)
curve(predict(logistic_lung_mefv, data.frame(MEFV = x), type = "response"),
add = TRUE, lwd = 3, col = "dodgerblue")
abline(v = -coef(logistic_lung_mefv)[1] / coef(logistic_lung_mefv)[2], lwd = 2)

The sigmoid function looks poor..!

Advantages and disadvantages of LR

Advantages:

  1. Very easy to implement, explain and train.
  2. High accuracy for simple datasets and it performs well when the dataset is linearly separable.

Disadvantages:

  1. If the number of features >> number of samples, LR should not be used to avoid overfitting. Our dataset doesn’t have this problem. However, if I reduce the sample numbers to less than 10, overfitting could happen.
  2. Non-linear problems can’t be solved with LR. So, we have to examine the datasets at the beginning.

If you like my article, please give me some support!. We need your claps for more motivation!

--

--

James Dai

A passionate writer. Somewhere in between data scientist, bioinformatician, oncologist and immunologist. DPhil at Oxford University