Why is there such little focus by advanced AI & Math scientists in Genomics?

Freedom Preetham
Meta Multiomics
Published in
4 min readNov 26, 2022

A general observation I have seen in oncology (cancer research) is the outdated analysis methods they still use in many labs.

For example, let’s assume you are conducting a knockout test for implicating genes that cause Breast Cancer. Let’s say the genome-wide CRISPR knockout screen implicates 15 genes out of 100. This means that in the current screen test, 15 genes out of 100 genes in your genome are involved in Breast Cancer.

Enrichment analysis checks the underlying gene pathways involved in breast cancer development in general (beyond the 15 that were implicated on the knockout screen.)

A good majority of the labs use a statistical technique called Fisher’s exact test (or sometimes hypergeometric test) that looks at a gene ontology database of annotated genes and conducts a “Permutation test” to see how many of a specific class of gene (Let’s say DSP repair genes) is also implicated relative to the 15 on the screen. This is called gene set enrichment analysis and pathway analysis.

The problem with permutation tests is that they rely on “P-values,” which is a very vague standard for hypothesis testing. It is a mechanism to reject the null hypothesis and not conclusively “negate” it. There have been pages written on why p-values are a bad statistical estimate.

You will get fired if you try to get away with a Fisher’s test when you build recommendation models or click-stream analysis in domains like search engines or online advertisements. How is this still ok in Genomics?!

Here is a paper that shows the “statistical techniques” they use in the path-way analysis. https://www.ncbi.nlm.nih.gov/books/NBK550334/

Most of the analysis is p-value statistical power variations or negative binomial distribution analysis. Why are we still hand-stitching genomics with statistical analysis?

As an analogy, before Machine Translation models were invented for language-to-language translations, the entire industry was hand-stitching the translations with symbolic AI and grammatical models based on linguistic grammar. This changed entirely after sequence-to-sequence transformer models started using attention mechanisms to automatically learn language translations “without” human support to stitch grammar.

The Computational Genomics world suffers from statistical techniques akin to the early days of language translation. Imagine writing rules for self-driving cars just purely on Symbolic and Statistical grammar!!

Machine Learning techniques exist primarily as research papers in arxiv or university labs. They have not seemed to translate into large-scale industry practice. This is because the analysis tools are not updated. Incentives are not aligned to do so. They get the job “done,” so seemingly good enough.

Very rarely do you see people in Genomics research labs use Dynamic Programming (by minimum) to do whole-genome analysis and borrow from the entire ontology database to implicate “all” possible annotations as against specific hand-selected annotations. DP is the bare minimum you can do (There are a few advanced labs that employ this technique, but these are very far and few)

There are far more advanced techniques based on Transformers, GANs, and DQNs that you can apply for the entire ontology database during gene set enrichment analysis and pathway analysis for high-throughput screens than the bare minimum statistical techniques they get away with today.

Do you wonder why the labs get away with statistical tests, though?!

I will tell you why. This is because the greatest of the AI and Math minds are employed by Big Tech to solve click-stream problems! Fat paychecks and complacent lifestyles had attracted the brightest minds to solve for the “Internet industry.” And the internet industry runs of “media dollars” run by advertisement and marketing.

I do not have any issues if you are making the internet efficient and building better search engines (which is needed for advancing human knowledge). Still, not all of you should be concentrated on optimizing the semantic web.

We need the advanced AI, Math, and Physics folks to start focusing on computational genomics and lift this industry to greater heights. There should be more money going into Computational Genomics.

(Besides, there should be a way to tap into the “Media Dollar” mainstream to funnel some capital into Genomics.)

DeepMind seems to be doing this in pockets (AlphaFold as an example). This is not enough. We need more DeepMinds in every area of Omics.

If you are in any advanced field of AI research, Math modeling, and working on multi-modal, multi-agent DeepRL, then you should start focusing on Computational Genomics NOW.

#genomics #AI #machineintelligence #machinelearning #statistics #molecularbiology

--

--