Bioinformatics #20: Myelodysplastic syndromes gene expression prediction using machine learning linear model

7 min readApr 9, 2023

In this week, finally I am going to write a coding project about bioinformatics. This project has been already planned in a long time because It’s going to be weird discussing bioinformatics without coding activity.

I don’t want to write bioinformatics web application forever because there are a lot of things that can be done in programming as well. Without further ado, here is my first bioinformatics coding project that uses Machine Learning. I know it’s a very simple bioinformatics project but I guess this article will be useful for someone who want to start their pilot programming bioinformatics project. You can use this as one of your references in learning bioinformatics.

Gene prediction analysis

There are many ways to explore bioinformatics in the coding domain from genomics data as the input until clinical data as the input. In my situation, I’m interested with transcriptomics data and Machine learning application. So, I’m wondering how to integrate these two domains into something that is valuable for human life?

I am searching several articles and I find an interesting article from Liu research team back in 2019 when they wrote an article about predict genes expression using generalized linear model (GLM). When I read their paper and they got cited more than hundred times, I conclude that gene prediction is a good application that is needed in biomedical science. Prediction model can be used to check the relationship between the gene or combination of genes (input data) and phenotype result that you want such as promoter region or even another gene expression (output data).

One of the machine learning model that I want to apply is linear model that demands continuous input data and continuous output data. Since Liu research was applied in linear model system, I am interested to recapitulate their research model but I will use different dataset. The dataset that I find is myelodysplastic syndromes that used microarray to check gene expression (Kim, et al., 2015).

Myelodysplastic syndromes

The background of this research was Kim and colleagues tried to investigate gene expression in several myelodysplastic syndromes patients because they wanted to know bone marrow environment between myelodysplastic syndromes patients and normal patients. To check their bone marrow environment, they used mesenchymal stem cells (MSCs) from each patient to get genes expression through microarray analysis.

I check their data on NCBI GEO to get the microarray data (input data) and I also check the patient platelet blood cells (PLT) data (output data). My goal is I want to know whether the top 20 significant genes can be used to predict patient platelet blood cells or not. Hopefully, if the machine learning (ML) model is good, then the top 20 significant genes in this dataset can be used to be biomarkers for Myelodysplastic syndromes platelet blood cells detection.

Analysis procedure

The analysis starts with differential expression genes analysis because we want to find top 20 most significant genes in the microarray data using Myelodysplastic syndromes patients compare to normal patients. Then, we have to read the original paper (Kim, et al., 2015) to find the platelet blood cells in each patient as the output data. Next, we integrate them into one spreadsheet and we will analyze it using scikit package from python. For scikit package application in python instruction, I give credit to Robert Thas John’s article that have explained about regression linear model ML application.

1. Differential expression genes analysis

Figure 1. GEO2R interface that allows user to analyze without programming knowledge. Source: https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE61853

To do differential expression genes (DEGs) analysis, we will use GEO2R which will allow to analyze microarray data automatically (Barret, et al., 2012). The plus feature from GEO2R is they also provide the R source code that we can use for additional modification (Figure 1). In this case, I select the sample and classify them into “normal” and “sick” patients. Then, I modify the code to get the raw gene expression data that can be used for regression linear model machine learning input data.

Figure 2. Boxplot expression data from each samples. Source: https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE61853

Before retrieve the gene expression data, we have to make sure the data has been normalized or not. To check normalization data, we can see the boxplot result and you see all the median score in all samples are equal (Figure 2). Then, we can get extract the gene expression of the top 20 DEGs. You can see the code in here for further detail.

# load series and platform data from GEO

gset <- getGEO("GSE61853", GSEMatrix =TRUE, AnnotGPL=TRUE)
if (length(gset) > 1) idx <- grep("GPL10558", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

# make proper column names to match toptable 
fvarLabels(gset) <- make.names(fvarLabels(gset))

# group membership for all samples
gsms <- "11111110000000"
sml <- strsplit(gsms, split="")[[1]]

# log2 transformation
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
  (qx[6]-qx[1] > 50 && qx[2] > 0)
if (LogC) { ex[which(ex <= 0)] <- NaN
exprs(gset) <- log2(ex) }

# assign samples to groups and set up design matrix
gs <- factor(sml)
groups <- make.names(c("SYNDROMES","CONTROL"))
levels(gs) <- groups
gset$group <- gs
design <- model.matrix(~group + 0, gset)
colnames(design) <- levels(gs)

fit <- lmFit(gset, design)  # fit linear model

# set up contrasts of interest and recalculate model coefficients
cts <- paste(groups[1], groups[2], sep="-")
cont.matrix <- makeContrasts(contrasts=cts, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)

# compute statistics and table of top significant genes
fit2 <- eBayes(fit2, 0.01)
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250)

tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","Gene.symbol"))
tT[tT == ""] <- NA 
tT <- tT[!is.na(tT$Gene.symbol),]
write.table(tT, file=stdout(), row.names=F, sep="\t")


#Top 20 significant DEGs
top_20_genes <- tT[1:20,]
top_20_genes_code <- rownames(top_20_genes) #extract top 20 Illumina code for the raw gene expression extraction

top_20_genes_expression <- ex[rownames(ex) %in% top_20_genes_code,]
top_20_genes_code %in% rownames(top_20_genes_expression) #Check either the gene expression ID is extracted correctly

2. Extract the information from the original paper

After get the top 20 expression data, we have to find the output that we want. In this case, I want to analyze platelet blood cells concentration and determine either top 20 genes can be used to determine platelet blood cells. In the paper, you will see a table that will describe the patients information in detail. I extract the data from there. I combine the gene expression data and the output data which later becomes an integrated spreadsheet (Figure 3).

Figure 3. The integrated spreadsheet consists of top 20 DEGs and the output score as the platelet cells.

3. Machine learning linear regression model

In this machine learning analysis, we used the integrated table directly. First, we will have to split the data into training set and test set. Second, we conduct the fitting process into linear regression model.


#Linear regression modelling

steps = [
    ('scalar', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2)),
    ('model', LinearRegression())
]

pipeline = Pipeline(steps)

pipeline.fit(X_training, Y_train)

print('Training score: {}'.format(pipeline.score(X_training, Y_train)))
print('Test score: {}'.format(pipeline.score(X_test, Y_test)))

In this case, we have to build a pipeline to make it easy for us to operate the machine learning process (John, 2018). In this case, I select polynomial features because we want to know each gene contribution from total 20 genes to the linear regression model. The result is quite bad because the training score is 100% but the test score is minus. It means there is something wrong with the input data which I will explain the reason later. For now, we can use regularization to increase test score.

4. Regularization

Regularization is machine learning optimization technique to reduce overfitting. Overfitting usually happens when the training data is higher than test data (John, 2018) or when the plot too closely aligned with the data points.

#Regularization using lasso regression
steps = [
    ('scalar', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2)),
    ('model', Lasso(alpha=0.2, fit_intercept=True))
]

lasso_pipe = Pipeline(steps)

lasso_pipe.fit(X_train, y_train)

print('Training score: {}'.format(lasso_pipe.score(X_training, Y_train)))
print('Test score: {}'.format(lasso_pipe.score(X_test, Y_test)))

In this article, I will only show lasso regularization technique but actually there are several techniques that you can see in my complete code. In this lasso regression result, the test score is 99% but the training score 67%. Although the test score now is better than the previous model, it’s still weird because test score should not be higher than the training score. In this situation, I will try address this solution next week.

References

Liu, S., Lu, M., Li, H., & Zuo, Y. (2019). Prediction of Gene Expression Patterns With Generalized Linear Regression Model. Frontiers in genetics, 10, 120. https://doi.org/10.3389/fgene.2019.00120
Kim, M., Hwang, S., Park, K., Kim, S. Y., Lee, Y. K., & Lee, D. S. (2015). Increased expression of interferon signaling genes in the bone marrow microenvironment of myelodysplastic syndromes. PloS one, 10(3), e0120602. https://doi.org/10.1371/journal.pone.0120602
https://medium.com/coinmonks/regularization-of-linear-models-with-sklearn-f88633a93a2
Tanya Barrett, Stephen E. Wilhite, Pierre Ledoux, Carlos Evangelista, Irene F. Kim, Maxim Tomashevsky, Kimberly A. Marshall, Katherine H. Phillippy, Patti M. Sherman, Michelle Holko, Andrey Yefanov, Hyeseung Lee, Naigong Zhang, Cynthia L. Robertson, Nadezhda Serova, Sean Davis, Alexandra Soboleva, NCBI GEO: archive for functional genomics data sets — update, Nucleic Acids Research, Volume 41, Issue D1, 1 January 2013, Pages D991–D995, https://doi.org/10.1093/nar/gks1193
https://github.com/michaelanekson/gene-expression-prediction-project