NGS101: How to process NGS data from raw reads to annotated variants? (0)

Zeo Choy
Bioinformatics 101
Published in
5 min readMar 6, 2018

Preface — Intention, Intended Audience and Objective

This is the first article of a series of tutorials intended for biologists who know nothing about bioinformatics. I, as written in the previous story, learn by myself online, but found the materials are scattered over different websites or not very practical. That’s why I start this tutorial so as to share and hopefully help others who could like to learn it.

Before it begins, I have to say the bioinformatics field is still evolving, there’s no gold standard for the procedures. And new packages, algorithms, softwares, suites, pipelines are released every day. I’m also no expert in the field. This tutorial is meant to concentrate the essence of performing analysis. I hope at the end of this series, you could know how to (1) complete the very basic bioinformatic workflow (from FASTQ preparation, sequence alignment, variant detection to variant annotation at least), (2) identify an appropriate tools for your own project, and (3) use that tool.

Background on NGS

To be honest, I haven’t ever prepared the library, loaded the samples to sequencer myself. Hence, I’m not going to talk about this, and I don’t think there’s need to. It shall be handled by experienced personnel from your institution’s core facilities or service provider at most of the time. I also not go to discuss the working principles of NGS, because it is available everywhere online. If you don’t know yet, you can visit Wiki, or EMBL-EBI to get a quick preview.

To be concise, I’d like to point out there’re three main stream of sequencing platform:

  1. Illumina
  2. Ion Torrent
  3. PacBio

The choice of platform usually is up to your project need, and what you are going to sequence. The data processing of PacBio is out of scope of this tutorial, because I mainly came across data from Illumina/Ion Torrent platform.

General workflow

It is a not-so-general flowchart of the data processing (which termed as bioinformatics service from commercial company) adopted from my presentation.

In brief, it consists of three important steps summarized in the table. I’ll explain the actual working in the next story.

Primers

Before you get started, you don’t have to have a powerful computer, but a few things that you must know.

  1. Command Line

If you have never utilised it, it’s your time to try it on. Otherwise, feel free to jump into next paragraph.

  • Open your command line interface (Command Prompt for MS Windows, Terminal for MacOS/Linux).
  • Try ping -c5 yahoo.com .

Congratulations! You’ve just did a ping test to see if you can reach yahoo.com. It’s a for fun hello command line test. You’d seldom use ping for data processing.

Below are some particulary useful commands when process text file:

  • grep and wc
$ wc -l $TXTFILEPATH

wc -l counts the number of lines in a plain text file.

$ grep "TP53" $VCF$ grep -v "chrMT" $VCF

grep is a search utility. grep "TP53" print out all the lines containning TP53. The -v tag in grep -v "chrMT" means inverted, so it prints out all the lines do not contain chrMT.

You can pipe(|) grep into wc -l . It is a quick and dirty way to find out how many TP53 mutations in the VCF.

$ grep "TP53" $VCF | wc -l
  • sed

sed is used to replace a word pattern (“chrM”) into another (“chrMT”).

$ sed 's/chrM/chrMT/g' $VCF
  • cut

cut can extract the nth columns from a TSV (tab-separated values). The example below cut the 1st, 2nd, 4th and 5th column from a VCF which is also tab-separated, it corresponds to CHROM, POS, REF, ALT field.

$ cut -f1,2,4,5 $VCF

2. R/Python

Apart from the command line utilities, choose either one (or both of course) language to stick with. I’d suggest R to whom without programming background, simply because the IDE (integrated development environment) of R — RStudio is more intuitive and Bioconductor works well with R.

(If you do really eager to learn python, please look at Atom. I also tried Rodeo, but switch to Atom shortly after. Also be aware of the version used, you may need both python 2.7 and python 3.)

  • R*

Installation should be straightforward. In case you need a reference, please visit STHDA which provides a concise guide on installing R and setting up RStudio.

Layout of RStudio. From STHDA.
  • Python

Atom is a light-weighted (programming) text editor with packages available. Installing Atom may be less simple than RStudio, but follow the official guide shall be good. After that, remember to install hydrogen as well to interactively code in Atom.

Layout of Atom. Demo from hydrogen.

Conclusion

Introduce the general bioinformatics workflow, use of command line and install R!

Recommended courses/ebooks

If you want to rather learn from experts or get a deeper understanding, here’s some suggestions for you.

  1. The Biostar Handbook: A Beginner’s Guide to Bioinformatics ***
    Please do consider buying this book, it’s still being updated. There’re severals subscription options with student discount. The book teach with working examples and scripts. It even covers more topics and discuss deeper than my tutorials.
  2. DataCamp
    If you want a proper and certified training in R/Python, it is an online platform to learn data science with R and Python hassle-free. You do not need any prior setup, all coding exercises are done in your web browser.
  3. PH525x — Biomedical Data Science
    It’s an online ebook for EdX course PH525x (closed). It is quite theoretical if you’d like to know more than get a hand-on experience.

--

--

Zeo Choy
Bioinformatics 101

PhD. Interests in Cancer Biology, Bioinformatics/NGS, Deep Learning.