Opening a can of worms … words
Nowadays, we are capable to obtain genome-wide and high-throughput data from a variety of biochemical assays as a proxy to determine regulatory regions. However, the data (“big data”) obtained from such assays present integration and interpretation challenges. Luckily, Machine learning (ML) techniques are ideally suited to address the issues, making ML application to unravel gene regulation a very exciting and active research field.
The application of ML in biological research is not new. Querying in PubMed (Note A) shows the earliest reference from 1990 (1) applied to prediction of protein secondary structure. So far, ML techniques, such as Support vector machines (SVM)(2-3), Naïve Bayes classifier (NB) (4) and more recently Neural networks (NNs)(5-7), have been exploited in the integration and interpretation of genomic data to successfully pinpoint functional non-coding regions (likely regulatory ones) in the genome .
Even when ML is not a novelty, still its application is unfamiliar to the typical biologist. To alleviate it and to allow for their use by the non-computational biologists, many of the tools are aimed to be used as a black-box, either as standalone software (8)(Note B) or as web-services (9).
Above described efforts are welcomed, but in my opinion, there is a gap in providing solutions to allow the non-computational biologist to experiment with the intermediate layers inside the ML “black-box”. Given that some of the published ML applications have some code written in Python and R, it would be possible to provide notebooks (e.g., Jupyter notebooks, R studio notebooks) which are great to learn (and play) using toy datasets or even some real ones. Having these notebooks available might be useful to generate a better understanding of the capabilities and limitations of the ML methods and in addition helping to achieve a higher reproducibility.
So, what I am doing besides complaining? Well, First and foremost, I am adding the wish to my annual letter to Baby Jesus (and to Santa, just to be covered in the north hemisphere). Second, I want to give a try to the notebook idea, and perhaps make some of them available linked to the blog. Let’s attempt similar analysis, or inspired by, some hot published papers. Lets see how it works…
Thank you for the fishes
— K
On the unalienable right to express my opinions …
Note:
A. Pubmed queries: “machine learning proteins” “machine learning DNA”, “machine learning biology”, “machine learning genetics” and “machine learning biochemistry”
References:
- King RD, Sternberg MJ. 1990. Machine learning approach for the prediction of protein secondary structure. J Mol Biol. 216(2):441–57.
- Lee D, Karchin R, Beer MA. 2011. Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 21(12):2167–80.
- Liu MJ, Seddon AE, Tsai ZT, Major IT, Floer M, Howe GA, Shiu SH. 2015. Determinants of nucleosome positioning and their influence on plant gene expression. Genome Res.25(8):1182–95.
- Torkamani A, Schork NJ. 2008. Predicting functional regulatory polymorphisms. Bioinformatics. 24(16):1787–92.
- Alipanahi B, Delong A, Weirauch MT, Frey BJ. 2015. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 33(8):831–8.
- Zhou J, Troyanskaya OG. 2015. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods. 12(10):931–4.
- Kelley DR, Snoek J, Rinn JL. 2016. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26(7):990–9.
- Ritchie GR, Dunham I, Zeggini E, Flicek P. 2014. Functional annotation of noncoding sequence variants. Nat Methods. 11(3):294–6.
- Fletez-Brant C, Lee D, McCallion AS, Beer MA. 2013. kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets. Nucleic Acids Res. 41(Web Server issue):W544–56.