Resources to become a computational biologist outside of academia
A shortlist
Choosing someone to be your teacher is to bestow a great honor — you are giving them your time, which has a high opportunity cost, and your trust. I trust my teachers to tell me what they think is important given their knowledge and experience. Of course that does not relieve me of responsibility. Doing everything your teacher says is foolish. In the world of academia it is often said that your boss, aka the principle investigator (PI), is not the expert when it comes to your project. Instead, you, the PhD student or postdoc, are the expert because you hold specific relevant knowledge while your PI holds much more broad and varied knowledge. This makes sense because your PI has to connect more disparate far reaching stories to direct you and others in their research team, while you focus on one (or two) “local” stories. Good working situations should also reflect this interpersonal chemistry.
I am writing this blog because I have spent the last few months going around like P.D. Easterman’s newly hatched bird asking “Are you my teacher?” and I needed an outlet for my findings. Below is a list of resources that I have brought to the surface after sifting through a much longer list. The longer list can be found in this google doc. This doc is open to edits and your addition of resources that have helped you learn are both welcome and appreciated. For you to gauge if my filtering is relevant for you, it will help to know a little about me.
I graduated from UC Davis with a BS in Biochemistry and Molecular Biology, I got my PhD from Imperial College London, and I did a postdoc at Johns Hopkins. During my PhD and Postdoc I collaborated with bioinformaticians and used Geneious software to visualize phylogenetic analyses (aka hierarchical clustering) that allowed me to make fascinating insights into influenza virus evolution. When I started this transition, the most coding I had done was to write a short script for ImageJ, and a few if then statements in Excel, which basically means I did not know how to code!
Below I have outlined the resources that I think are worth looking into on my path to become a bioinformatician. Writing all this down helps me see that there is a finite amount of material that would get me to an intermediate level. (Hopefully, I will be employable before I get through all of it.) Furthermore, writing it down allows me to crowdsource input from all of you — please comment below to add things that I missed!
To become a bioinformatician, I figure that I need a five-pronged approach to learning. I plan to interweave my study time between these topics:
- (1) Coding: I need to learn to use Python and some of its libraries. When Python is not enough, I need to be able to use R, SQL, CSS, and HTML to process and present big data
- (2) Genetics and proteomics: I need to be able to design primers and barcodes with code. I need to understand the difference between de-novo and template assembly of microarray and NGS data and be able to perform both. I want to understand and be able to perform protein docking predictions. I also want to understand and be able to produce protein network diagrams. Generally, I want to predict how molecular and cellular interactions affect health
- (3) Math and Machine Learning: I took calculus and basic statistics in college and I used some stats during my research, but I need to brush up on them. Plus I need to become strong in linear algebra and probability in order to build predictive models
- (4) Projects!!! I can’t over emphasize this enough — projects are what push me to learn. They are what get me out of bed in the morning. The desire to know the answer gives me drive and keeps me going
- (5) Blog: once a week to reflect on what I learned, what I liked and what I didn’t. To note any new resources that I think might be worth using and ask the community questions
Coding resources
- Python tutorial: Codecademy [$20/mo]
- Python tutorial: DataQuest [$30/mo]
- Python tutorial: CodeWars [free]
- Python exercises: Project Euler [free]
- Python exercises: do the Programming Exercises from “Problem Solving with Algorithms and Data Structures Using Python, release 3.0”, by Brad Miller and David Ranam (Sep 2013), [HTML, pdf, both free]
- Python exercises: Learn Python the Hard Way, Python 3 edition, by Zed Shaw (June 2017), [Exercises are free, pdf & videos are $29.99]
- Python book: “Python for Data Analysis”, by Wes McKinney (Sept 2017) [Focuses on data wrangling with Pandas and some SciKitLearn, $24.99]
- Pandas documentation: Cookbook
- Python tutorials: PyVideo has links to PyCon conference tutorials and seminars
- R book: “R for Everyone: Advanced Analytics and Graphics, 2nd edition”, by Jared Lander (2017) [$28.79]
- WDL: workflow description language
- SQL: Mode Analytics SQL Tutorial
- SQL: SQLAlchemy tutorial
- SQL: FreeCodeCampSQL 4 hrs video
- SQL: DataCamp SQL for Data Science
- SQL: Udemy: The Complete SQL Bootcamp, by Jose Portilla
Genetics, proteomics, and systems-biology resources
- Genome Analysis Tool Kit (GATK) from the Broad Institute. Start by looking at their Best Practices pipelines for variant calling with genomic, somatic, and RNA-seq data
- SAMtools for processing genetic data
- Biostar Handbook + Online Course (no videos, just slides & homework). Free forum, book and course costs $25 or $35]
- Rosalind, a platform for learning bioinformatics and programming through problem solving
- Applied Computational Genomics Course, Univ of Utah, by Aaron Quinlan. Github
- Python for Bioinformatics by Sebastian Bassi (Kindle $67)
- Evolutionary Genetics: Concepts, Analysis & Practice book ($50) and practicals/exercises
- Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing 1st Edition, by Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. Textbook $42
- Systems Modeling in Cellular Biology: From Concepts to Nuts and Bolts by Zoltan Szallasi, Jorg Stelling, and Vipul Periwal, published in 2010. Kindle $19
- Twitch Bioinformatics Beyonce home page. Live stream on Mondays at 3pm USA PST. Clips. Youtube.
- EdX: Introduction to Genomic Data Science, taught by Pavel Pevzner and Phillip Compeau, UCSD, 1 course
- EdX: Using Python for Research, taught by Jukka-Pekka (JP) Onnela of Harvard University, 1 course, skip to weeks 3–5 (case studies & scikitlearn)
- EdX: Quantitative Biology Workshop, taught by Eric Lander and five others from MIT, 1 course
- EdX: Calculus Applied, taught by John Wesley Cai, and two others, Harvard, 1 course
- EdX or Coursera: EdX Algorithms and Data Structures taught by Pavel Pevzner and five others, UCSD, 8 courses. Coursera Bioinformatics Specialization, Pavel Pevzner and two others, 7 courses
- Coursera: Genomic Data Science Specialization (Python), taught by Steven Salzberg and seven others, Johns Hopkins, 8 courses
- EdX: The Multi-scale brain, taught by Sean Hill and eleven others, EPFL, 1 course
- EdX: PH525x series (uses R), by Harvard: Data Analysis for Life Sciences, 4 courses & Genomics Data Analysis, 3 courses
- Coursera: Systems Biology and Biotechnology Specialization, Mt Sinai, taught by Avi Ma’ayan (Director of Bioinformatics center) and six others, 6 courses
- Coursera: Coding the Matrix: Linear Algebra Through Computer Science Applications, Brown University, 1 course
- Coursera: Data Science Specialization, by Roger Peng, Jeff Leek, and Brian Caffo, Johns Hopkins, 10 courses
Math and machine learning resources
- Fast.ai: Machine Learning, Jeremy Howard
- Fast.ai: Deep Learning Part 1: Practical Deep Learning for Coders, Jeremy Howard
- Fast.ai: Deep Learning Part 2: Cutting Edge Deep Learning for Coders, Jeremy Howard
- Fast.ai: Computational Linear Algebra: Online textbook and Videos, Rachel Thomas
- Khan Academy: AP calculus, linear algebra, statistics & probability
- Coursera: Machine Learning, taught by Andrew Ng, Stanford, 1 course
- Coursera: Mathematics for Machine Learning Specialization, Imperial College London, 3 courses
- Python book: Python Machine Learning, 2nd Edition, by Sebastian Raschka (Sept 2017) [This book includes SciKitLearn, TensorFlow, and Keras, see his github for free exercises, the book is $25.99]
- Stats book: Open Intro to Stats, [focus on chapters 2 & 3, pdf free]
- Stats book: Think Stats, Version 2.0.35, by Allen Downey (2014) [free]
- Stats book: Think Bayes, Version 1.0.9, with Python 3 code, by Allen Downey (2013) [free]
- R book: “An Introduction to Statistical Learning: with applications in R”, by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani [Originally from 2013, corrected 8th printing 2017 free pdf]
- Stats book: “Data Analysis Using Regression and Multilevel/Hierarchical Models,” by Andrew Gelman and Jennifer Hill (2007) [$46.03]
- Stats book: “The Elements of Statistical Learning: Data Mining, Inference, and Prediction” by Trevor Hastie, Robert Tibshirani, Jerome Friedman (year?) [free pdf]
- Stats books: Other Andrew Gelman books: “Bayesian Data Analysis” (2013) and “Handbook of Markov Chain Monte Carlo” (2011)
- R book: “Applied Predictive Modeling” by Max Kuhn and Kjell Johnson (May 2013) [This book is very intuitive and will introduce you to Caret pdf]
- R book: “Deep Learning with R” by Joseph J Allaire (Jan 2018) [I can’t find a digital edition! This book covers Tensorflow and Keras]
Bioinformatics Conferences
- Bioinformatics Open Source Conference (BOSC), international, yearly in July
- Precision Medicine World Conference, usually held in Santa Clara convention center, held yearly in January
Projects
- Kaggle: google with keywords: cancer, protein, seq, genome, gene expression, microarray, single cell, RNA, medicine, or virus, etc…
- Google Data Search: search with same terms above
Blog
- Medium: a good place to start
- GitHub Pages: I might switch to this when I want to show my code
Disclaimer: I don’t claim to be perfect. If you catch a mistake or if you have better links, summaries or important points about any of these items, please comment below.
Bootcamps: I would be remiss not to mention bootcamps. Bootcamps offer many free seminars and I have taken some of those. I haven’t taken any bootcamp series longer than two days. Through talking to people about their bootcamp experience, I gather that the trick is to sign up for a bootcamp at the right time. You want to have some competencies and some questions. Bootcamps seem to provide a network of people that might connect you with a job, so the most important thing to ask the admissions office is where do their graduates work and the most important thing to ask yourself is do you want to work in those companies. It is also important to hear that it can take up to 6 months after doing a bootcamp to find a job if you are in the 80% of students who came into that bootcamp with significant competencies. This was different three years ago when you didn’t need to know any programming to enter a bootcamp and it was easy to find an entry level job afterward, but we now have a lot more Computer Science majors to compete against. (If anyone wants to post a graph or even ideas on how to gather data to support this point I would very much appreciate it!)
Udemy: I haven’t listed any Udemy courses here because I haven’t heard any raving reviews. Zero to Hero Python 3 is the only course that stuck out. I prefer to learn from PyCon videos. If you loved a Udemy course, please let the rest of us know what it was by leaving a comment.
Thanks for reading!