Is your differential expression result lying to you ?

6 min readDec 3, 2022

About 6 months ago, I was working late at the night with a lot of differential expression comparisons over multiple datasets for a project. I realized that something was fishy with my results. Most of the hits across different comparisons didn’t seem to have any actual expression difference between the cohorts. At first, I thought that I made a mistake while scaling or during other processing steps of the RNA sequencing datasets but I traced back and found that there wasn’t any issue with processing. This irked me so much that I decided to pull an all-nighter and found much-required clarity regarding differential expression methods and data.

Dummy data showcasing the inflated false positives based on average difference in normalized expression and logFC

I realized on that day that when it comes to solving a biological question using computational methods a sizeable portion of the community is mostly focused on the methods themselves. It’s usually perceived that a standard method when applied can solve all the cases. Reality is much farther from it as the method plus context together decide the closeness of the outcome to the true picture.

In the case of differential expression, the general perception is that relatively new methods (DESeq2, edgeR, and others) would work well and would identify a gene to be differentially expressed (DE) with very high accuracy. However, accurate prediction of DE genes is equally if not more dependent on data as the method.

Background

Differential expression methods were broadly created to study differences between two different biological states (Tumour vs Normal, Cirrhotic liver vs Non-Cirrhotic liver, Covid-19 Severe vs Asymptomatic cases) based on the expression of genes. Broadly with microarrays, non-parametric methods such as t-test and Wilcoxon Rank Sum test were used and later specific methods such as limma evolved. They didn’t consider data to be distributed in a certain way and relied on general hypothesis testing to identify genes of interest based on effect size (fold change) and statistical scores (p-values). Later on, with RNA-Seq, more parametric methods came into the picture such as DESeq, DESeq2, edgeR, and others. The primary change in the way these methods work is based on understanding that expression data in general has a Negative Binomial (NB) or Poisson distribution. This is based on the fact that most genes in two biologically different states have similar expression and there are very few genes that have a difference in those two states.

With this assumption, the belief came to be that methods such as Deseq2 and edgeR are advanced and work perfectly almost 100% of the time. However, if that’s the case, why did I get a discrepancy in the actual expression difference and fold change? I searched for literature to understand differential expression in depth.

Reasons

Data distribution to the good extent is the factor affecting the presence of inflated false positives in your DE results. The primary assumption that the gene expression has a NB or Poisson distribution isn’t correct in all cases. There are different backgrounds over which an expression distribution can loose its NB/Poisson nature.

Large sample size:

Li et al, Genome Biology 2022 explored the reliability of popular DE methods Deseq2 and edgeR and found huge discrepancies in the results between the two methods and also observed high false discovery rates when working with population data of large sizes. They found that genes falsely identified (with negative control data having no true difference between two conditions) as DE genes with large populations didn’t fit well into the NB distribution. Both methods had exaggerated false positives when working with large-sized population data as the genes might not follow an NB distribution in these cases. On the other hand, the Wilcoxon rank-sum test had much better FDR control as it assumes a different null hypothesis that a gene has equal chances of being differentially negatively or positively expressed between two conditions.

The paper also observed that FDR control for a method depends a lot on the sample size. If the sample size is less than 8, then parametric methods might be more reliable in controlling the FDR for the DE data, however when the sample size is more than 8, then non-parametric methods have much more power (probability of rejecting the null hypothesis) and FDR control.

The author has also written a very succinct article describing their findings on Towards Data Science. The article helped me understand the reliability of Differential Expression methods in specific cases.

True difference:

Another study by Rocke et al, bioRxiv, 2015 (Pre-print and not peer-reviewed) explored the possible reasons for the cause of excess FDR in the DE results for different methods when there isn’t a true difference between the compared groups with simulated and public data. They also found out that in this case when there isn’t any true difference between the compared groups, the distribution doesn’t follow negative binomial fit and hence parametric methods (DESeq, DESeq2, edgeR) fail miserably in terms of FDR control and power.

Avoiding False Positives

Thus to trust that your DE results are telling you the truth, one need to play the data game. A few important considerations can help you avoid false positives in your data.

Always plot the distribution of the data before doing DE: If the distribution shows a NB/Poisson nature, use parametric methods such as DESeq2, edgeR and others to calculate DE results, if not, follow non-parametric methods.
Check sample size: If the sample size is greater than 8, always use a non-parametric method.
Use Principle Component Analysis (PCA): Make a PCA plot and check the difference in localization of samples of two different cohorts. If you are not getting a good difference then there isn’t a strong true difference in their expression profiles. If you have a lot of DE results in that case, chances are that most of them are false positives.
Sanity check based on expression: Do a quick calculation of average difference in expression for top DE genes between the two cohorts. If consistently you are observing a small difference in the average expression between the compared cohorts for most genes, chances are your DE results are full of false positives.

Ending Notes

Complexities should not be added unless necessary — Authors of Towards Data Science blog cited in the blog.

As pointed out by authors in the cited studies, one shouldn’t assume that a simple method such as limma or Wilcoxon rank-sum test isn’t of use in front of advance differential expression methods. The blog also doesn’t exist to say that methods like DESeq2 and edgeR aren’t of use anymore. Simply, the choice of method should be dependent on data as the context. Instead of relying on the complexity of the method, one should be dependent on the expression data itself to decide the DE method for use.

I would like to end the blog with a question.

Do you think there will be additional problems with DE of even higher dimensional data such as single cell RNA sequencing ?

References

Li, Yumei, et al. “Exaggerated false positives by popular differential expression methods when analyzing human population samples.” Genome biology 23.1 (2022): 1–13.
https://towardsdatascience.com/deseq2-and-edger-should-no-longer-be-the-default-choice-for-large-sample-differential-gene-8fdf008deae9
Rocke, David M., et al. “Controlling false positive rates in methods for differential gene expression analysis using RNA-seq data.” Biorxiv (2015): 018739.
Ritchie, Matthew E., et al. “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic acids research 43.7 (2015): e47-e47.
Love, Michael I., Wolfgang Huber, and Simon Anders. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome biology 15.12 (2014): 1–21.
Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” bioinformatics 26.1 (2010): 139–140.