Seeking signal in the midst of noise with R

Yen-Chung Chen
Jul 14, 2018 · 6 min read
Photo by Martin Barák on Unsplash
# Generating subject characteristics# Disease state
disease <- c(0,0,0,0,1,1,1,1)
# Sex
set.seed(1629)
sex <- as.numeric(runif(n = 8, min = 0, max = 1) > 0.5)
# Hidden confounding factor for each N
set.seed(714)
hcf <- runif(n = 8, min = 0, max = 1)
# Age
set.seed(2018)
age <- round(runif(n = 8, min = 0, max = 1), digits = 2)
Basic characteristics of the simulated subjects
Simulating expression that correlates with multiple factors
# Saving the genes and the random generated coefficients in two lists
feature <- list()
coef <- list()
# Generating 100 genes
for (i in seq(100)) {
set.seed(i)
# Generate 10 genes that is related to disease
if (i <= 10) {
coef_sex <- runif(n = 1, min = 0, max = (1 - i/10))
coef_hcf <- runif(n = 1, min = 0, max = (1 - i/10 - coef_sex))
coef_age <- 1 - (i/10 + coef_sex + coef_hcf)
noise <- runif(1, min = 0, max = 0.5)
feature[[i]] <- disease * (i / 10) + sex * coef_sex + age * coef_age + hcf * coef_hcf + noise
coef[[i]] <- c(i/10, coef_sex, coef_age, coef_hcf, noise)
next
}
# The rest 90 genes are no related to disease state
coef_sex <- runif(n = 1, min = 0, max = 1)
coef_hcf <- runif(n = 1, min = 0, max = (1 - coef_sex))
coef_age <- 1 - (coef_sex + coef_hcf)
noise <- runif(1, min = 0, max = 0.5)
feature[[i]] <- sex * coef_sex + age * coef_age + hcf * coef_hcf + noise
coef[[i]] <- c(0, coef_sex, coef_age, coef_hcf, noise)
}
# Formatting the expression matrix
expression <- do.call("rbind", feature)
coefficent <- do.call("rbind", coef)
colnames(expression) <- c("wt1", "wt2", "wt3", "wt4", "pt1", "pt2", "pt3", "pt4")
colnames(coefficent) <- c("Disease", "Sex", "Age", "HCF", "Noise")
expression <- data.frame(expression)
expression$gene <- c(
sapply(seq(10), function(x) paste("real", x, sep = "_")),
sapply(seq(90), function(x) paste("noise", x, sep = "_")))
The script above generated something like this.
library(limma)log_exp <- log2(expression[, -9])
log_exp$gene <- expression$gene
design <- model.matrix(~ 0+factor(c(0,0,0,0,1,1,1,1)))
colnames(design) <- c("Ctrl","Patient")
fit <- lmFit(log_exp[, -9], design)
fit$genes$ID <- log_exp$gene
cont.matrix <- makeContrasts(PatientvsCtrl = Patient-Ctrl, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
volcanoplot(fit2, highlight = 5)
tt <- topTable(fit2, number = 20, adjust = "BH")
# Reverse feature selection with caret
library(caret)
exp_rfe <- t(expression[, -9])
colnames(exp_rfe) <- expression$gene
ctrl <- rfeControl(functions = rfFuncs,
method = "repeatedCV",
repeats = 10, verbose = TRUE)
type_exp <- factor(c(0,0,0,0,1,1,1,1), labels = c("0", "1"))
lmProfile_exp <- rfe(x=exp_rfe, y=type_exp,
sizes = c(1:30),
rfeControl = ctrl)
plot(lmProfile_exp, type = c("g", "o"))
predictors(lmProfile_exp)
Reverse feature selection with caret (N = 80)
  1. With a small sample size, result of reverse feature selection with caret and limma are not much different than each other.
  2. When effect size is small, increasing sample number helps a bit in limma (adjusted p value goes down with increased sample number). caret seems to benefit more from larger sample number (because it’s machine learning?), but even with more samples, false positive is quite concerning.

biosyntax

Notes, thoughts, and random experiments in life science.

Yen-Chung Chen

Written by

A learning developmental biologist

biosyntax

biosyntax

Notes, thoughts, and random experiments in life science.