DESeq2 and edgeR should no longer be the default choices for large-sample differential gene expression analysis

Complexities should not be added unless necessary.

Jingyi Jessica Li
Towards Data Science

--

Assumptions may not be needed for large-sample data.

Author: fei.zhao (思考问题的熊)

Original post: https://kaopubear.top/blog/2022-03-20-donot-use-deseq2-edger-in-human-population-samples

Translated by Xinzhou Ge and Jingyi Jessica Li (UCLA) with the author’s permission

For our shorter introduction to our Genome Biology article, please see https://towardsdatascience.com/a-large-sample-crisis-or-not-640224020757

During my Ph.D. years, I enjoyed evaluating and comparing various bioinformatics software packages when I was not busy. Sometimes I would apply two or three differential analysis methods to transcriptome samples. When I wanted to be rigorous, I would take the intersection of these methods’ identified differentially expressed genes (DEGs); when I wanted to be “savvy,” I would pick the method whose results seemed to be the “best” (presumably you did the same).

In the year 2021, after I switched my research field from plants to cancer, when I analyzed large-sample datasets such as the TCGA data, I often directly used the Wilcoxon rank-sum test to compare tumor samples with normal samples to find DEGs. The reason was nothing but this classic test’s indefectible computational efficiency. Of course, for most of the time, I would use both the Wilcoxon rank-sum test and DESeq2, the advantage of which was that I could keep DESeq2 running for large-sample data and leave my seat for some rest.

In the year 2022, a few days ago, a paper published in Genome Biology used relatively rigorous arguments to suggest that the simple Wilcoxon rank-sum test should replace DESeq2 and edgeR in large-sample RNA-seq differential expression analyses, even when computational time is not a concern.

Here is the paper link (https://doi.org/10.1186/s13059-022-02648-4). You may read the original paper directly or continue reading my article.

Li, Yumei, Xinzhou Ge, Fanglue Peng, Wei Li, and Jingyi Jessica Li. “Exaggerated False Positives by Popular Differential Expression Methods When Analyzing Human Population Samples.” Genome Biology 23, no. 1 (March 15, 2022): 79.

How many replicates are needed for RNA-seq experiments?

Since the advent of RNA-seq, the most basic but critical application has been the identification of DEGs by comparing two groups (and sometimes multiple groups): tumor and normal tissue samples, cell lines under different treatment conditions, etc.

In practice, we usually measure only 3 replicates of high-throughput transcriptome data, and the “replicates” here refer to biological replicates rather than technical replicates. Sometimes 2 replicates would do, or even 1 replicate can be handled by some software.

Why 3 replicates? The reason is that “the third time is the charm,” and that material and budget are limited. Years ago, several research papers discussed how many replicates would be sufficient; however, they did not reach a consensus: some said at least 6 replicates, some said 4 replicates would be sufficient, and some said that finding all DEGs would require at least 12 replicates (see screenshot below).

Screenshot from https://rnajournal.cshlp.org/content/22/6/839.short

In 2016, a highly cited review paper about RNA-seq analysis, which was also published in Genome Biology, gave the following table with sample size recommendations. In short, at least 3 replicates are needed.

Screenshot from https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8
Screenshot from https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8

In fact, most of the previous analyses on the number of replicates were conducted based on popular software such as DEseq2 and edgeR. Now it is interesting to see that DESeq2 and edgeR are not recommended for large-sample analysis.

DEseq2 and edgeR are not suitable for large samples

From the above discussion on sample size, we can see that the biggest challenge of transcriptome differential analysis has always been the small sample size: are three replicates the charm?

Facing this small-sample problem, statistical methods and software were developed based on various parametric distributional assumptions about RNA-seq data. The two most popular methods are DEseq2 and edgeR, whose core assumption is that every gene’s sequence read counts follow the negative binomial distribution under one condition. Based on this assumption, they normalize the raw read counts according to their own set of logic. Another method DEGseq assumes the Poisson distribution instead of the negative binomial distribution, and it may be used when there is only one replicate per condition.

Moreover, software such as DEseq2 and edgeR inherently assume that samples under the two conditions are largely the same: most genes are not differentially expressed. This brings up the first question: in population-level RNA-seq studies where the sample sizes are from tens to thousands, can we still believe that most genes are not differentially expressed?

To evaluate the ability of DESeq2 and edgeR to identify DEGs, Li et al. (the authors of this new Genome Biology paper) tested DESeq2 and edgeR on 13 population-level RNA-seq datasets with total sample sizes ranging from 100 to 1376. The analysis revealed that DESeq2 and edgeR identified vastly different DEGs on these datasets.

Supplementary Figure from https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02648-4

From the results in the above figure, it can be seen that on the immunotherapy dataset only 8% of the DEGs identified by DESeq2 or edgeR were consistent (identified by both methods). On the other population-level RNA-seq datasets, DESeq2 identified a lot more DEGs than edgeR did.

When things go wrong, there must be demons. To see what the problem is, we need to conduct an in-depth investigation from the perspective of false discovery rate (FDR) control. Large samples have an inherent advantage for FDR evaluation: they can be permuted to generate negative-control samples, on which FDR can be estimated.

Hence, Li et al. permuted the two groups of samples, so that each new group contained samples from the two original groups. By repeating the random permutation 1000 times, Li et al. generated 1000 permuted datasets from the original dataset.

The results are thrilling.

First, DESeq2 and edgeR had 84.88% and 78.89% probabilities, respectively, to identify more DEGs from the permuted datasets than from the original dataset.

Second, of the 144 and 319 DEGs identified by DESeq2 and edgeR respectively from the original dataset, 22 (15.3%) and 194 (60.8%) were identified from at least 50% of the permuted datasets and thus spurious. That is, DESeq2 and edgeR had too many false positives.

Most importantly, Li et al. found that the genes with larger fold changes were more likely identified by DESeq2 and edgeR from the permuted datasets. In practice, we all tend to think that the genes with larger fold changes are more likely important, but the truth is that these genes are not necessarily differentially expressed.

One more thing here. If you are curious about what biological functions are enriched in those false-positive DEGs? Hmm… immune-related.

Figure 1 from https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02648-4

Why are DEseq2 and edgeR not working?

Why did DESeq2 and edgeR find so many false-positive DEGs from this immunotherapy dataset? The most immediate guess is that gene counts in this dataset no longer fit the negative binomial distribution assumed by DESeq2 and edgeR.

To test this hypothesis, Li et al. chose two sets of genes. One set contained the genes identified as DEGs from ≥20% permuted datasets; the other set consisted of genes identified as DEGs from ≤0.1% permuted datasets. Evaluating the goodness-of-fit of the negative binomial model to each set of genes, Li et al. revealed that the model fit was indeed worse for the first set of genes, consistent with the fact that these genes were false positives.

Further, why did the negative binomial model fit poorly? The authors examined all genes that were misidentified as DEGs in at least 10% of the permuted datasets. All these genes had outlier measurements relative to the negative binomial models fitted by DESeq2 or edgeR.

Supplementary Figure from https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02648-4

In parametric testing methods like edgeR and DESeq2, the null hypothesis is that a gene has the same mean expression under two conditions. Therefore, their analysis results would be seriously affected by the presence of outliers. In contrast, the Wilcoxon rank-sum test performs better in the presence of outliers because its null hypothesis is that a gene under one condition has an equal probability to have higher or lower expression than under another condition. That is, the Wilcoxon rank-sum test concerns more about the gene expression ranks than the actual gene expression levels.

Beyond DESeq2 and edgeR, on the immunotherapy dataset, Li et al. also compared several other representative methods, among which limma-voom is a parametric test like DESeq2 and edgeR; NOISeq and dearseq (a recently published method designed for large samples) are nonparametric tests like the Wilcoxon rank-sum test.

In terms of what methods found fewer false-positive DEGs, DESeq2 and edgeR clearly lost the competition, and the other methods performed well on this immunotherapy dataset. What if we compare these methods in terms of their power for finding true DEGs?

Li et al. generated 50 semi-synthetic datasets (with known true DEGs and non-DEGs) from each of the 12 GTEx and TCGA datasets. As shown in the figure below, only the Wilcoxon rank-sum test could consistently control the FDR in the FDR threshold range from 0.001% to 5%. Furthermore, in terms of power at the actual FDR, the Wilcoxon rank-sum test outperformed the other five methods. In the figure below, the blue line shows the Wilcoxon rank-sum test, and the yellow and purple dashed lines correspond to DEseq2 and edgeR.

Figure 2A from https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02648-4

When to use the Wilcoxon rank-sum test?

Since the Wilcoxon rank-sum test is suitable for DEG analysis of large samples, the key question is what sample size can be considered large?

To investigate how sample size affects the performance of the six DEG identification methods, Li et al. down-sampled each dataset to obtain datasets with sample sizes ranging from 2 to 100 per condition.

From the figure below, at the 1% FDR threshold, we should not use the Wilcoxon rank-sum test when the sample size per condition is less than 8. Meanwhile, when the sample size for each condition exceeded 8, the Wilcoxon rank-sum test achieved comparable or better power performance compared to the three parametric testing methods (DESeq2, edgeR, and limma-voom) and the other two nonparametric testing methods.

Considering that the proportion of DEGs among all genes might have an impact on the results, Li et al. also generated datasets with 5 DEG proportions (1%, 3%, 5%, 9%, and 20%), and the evaluation results showed that the Wilcoxon rank-sum test remained to have good FDR control and power.

Therefore, from Li et al.’s analysis results, when the sample size of each condition is greater than 8, do not hesitate to use the Wilcoxon rank-sum test.

Figure 2B from https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02648-4

How to use the Wilcoxon rank-sum test?

Everyone knows that the three parametric testing methods — DESeq2, edgeR, and limma — have long dominated transcriptome research. Essentially all large-scale studies have used them to find DEGs. However, all three methods were originally designed to solve the small-sample-size problem.

In population-level studies with larger sample sizes (at least a few dozens), parametric assumptions are often no longer necessary. At the same time, the larger the sample size, the more likely the presence of outliers, which would violate the parametric assumptions and invalidate the p-value calculation, making FDR control problematic.

The last question is how to use the Wilcoxon rank-sum test reasonably for differential expression analysis.

Unlike DESeq2, edgeR, and limma, the Wilcoxon rank-sum test is not a regression-based method and thus cannot adjust for possible confounding factors (such as differences in sequencing depths). Hence, before using the Wilcoxon rank-sum test, researchers are recommended to normalize RNA-seq samples to remove possible batch effects.

Regarding the data scale, I personally do not think it is a big problem whether we use the Relative Log Expression of DESeq2, the Trimmed Mean of M-values of edgeR, or the TPM. (If you are savvy, you will definitely be willing to try them all.)

Finally, Li et al. provided an R code example that used edgeR TMM + wilcox.test() for DEG analysis.

# read datareadCount <- read.table(file = "examples/examples.countMatrix.tsv", header = T, row.names = 1, stringsAsFactors = F, check.names = F)conditions <- read.table(file = "examples/examples.conditions.tsv", header = F)conditions <- factor(t(conditions))# edgeR TMM normalizey <- DGEList(counts = readCount, group = conditions)## Remove rows conssitently have zero or very low countskeep <- filterByExpr(y)y <- y[keep, keep.lib.sizes = FALSE]## Perform TMM normalization and convert to CPM (Counts Per Million)y <- calcNormFactors(y, method = "TMM")count_norm <- cpm(y)count_norm <- as.data.frame(count_norm)# Run the Wilcoxon rank-sum test for each genepvalues <- sapply(1:nrow(count_norm), function(i){data <- cbind.data.frame(gene = as.numeric(t(count_norm[i,])), conditions)p <- wilcox.test(gene~conditions, data)$p.valuereturn(p)})fdr <- p.adjust(pvalues, method = "fdr")# Calculate the fold-change for each geneconditionsLevel <- levels(conditions)dataCon1 <- count_norm[,c(which(conditions==conditionsLevel[1]))]dataCon2 <- count_norm[,c(which(conditions==conditionsLevel[2]))]foldChanges <- log2(rowMeans(dataCon2)/rowMeans(dataCon1))# Output results based on the FDR threshold 0.05outRst <- data.frame(log2foldChange = foldChanges, pValues = pvalues, FDR = fdr)rownames(outRst) <- rownames(count_norm)outRst <- na.omit(outRst)fdrThres <- 0.05write.table(outRst[outRst$FDR<fdrThres,], file = "examples/examples.WilcoxonTest.rst.tsv", sep="\t", quote = F, row.names = T, col.names = T)

Last words

After all, I feel that there is no need to have too much self-doubt about the analyses we have done before. But in the future, when doing differential analysis with more than 8 samples in each group, let’s not forget to also try the Wilcoxon rank-sum test. If someone asks why you don’t use the DESseq2 and edgeR specifically designed for differential analysis, just show this paper and, of course, this picture.

Image from https://kaopubear.top/blog/2022-03-20-donot-use-deseq2-edger-in-human-population-samples; modified by fei.zhao from the meme at https://knowyourmeme.com/memes/distracted-boyfriend

--

--

Professor of Statistics and Data Science, leading the Junction of Statistics and Biology Group at UCLA http://jsb.ucla.edu/