Why Should Biologists Analyze NGS Data Themselves?

Published in

Bioinformatics 101

5 min readMar 4, 2018

When you send your samples out for sequencing, it is very likely you get your data back along with bioinformatics done and excels/csv files containing well-annotated mutations information of your samples. You could probably work on those files, digging all the way through to identify any interesting targets to work on. Those files look nicely and easily manipulated, and you are familiar with Excel. You also trust your commercial service provider or your institution’s core service on the delivered result. Your nearby labs may said they also follow up like this and that. Then, why should you bother to (re)analyze it yourself, especially you know nothing about bioinformatics?

Before answering the question, I would like to share my background and experience on my first project involved NGS. I am in a small sized lab (not more than 5 people) which mainly do molecular cell biology work, e.g. tissue culture, molecular cloning, PCR, western blotting etc. And most of the members are biological science graduates. That particular project originally did not involve NGS. Sadly but not uncommon for every PhD students for every projects, the project came to a dead end. There was budget available somehow to perform targeted sequencing. The sequencing was outsourced to a well-known and one of the largestest commercial service provider. I still remember the excitement when we get our result back, but it did not last long. The result was out of our expectation, and there was no success after several rounds of communications between their in-house bioinformatician. Then I was asked to re-analyze it if possible. It might sound a bit unusual to ask or even trust a student without prior experience or expertise. But I was the only person in the lab who has engineering background that might be capable in doing so. This is how I started my journey into bioinformatics field. I learnt it hard online. Thanks for Google, Stack Overflow, Biostar, Quora…, I managed to finish the task after struggling with my computer for a week or two. It then landed me to another NGS project.

So back to the question, why should you as a biologist learn bioinformatics after all? Why bioinformatics? My simple short answer is, it is rewarding. Rewarding in the sense that learning bioinformatics equip a biologist more than any other laboratory skills, thus you should invest your time and effort even though the learning curve is steep.

From my experience, there are many research labs, especially the one in translational biomedical research, spending their money on NGS. The analysis pipeline might seen identical at the first sight, but it is not. There is no ‘one size fits all’ in this field particularly on academic research projects. No doubt the tasks can be easily automated for the same project across samples, but it is not probable to reuse every scripts or codes for the others. For example, project A may involve whole genome sequencing, while project B employ targeted deep sequencing. Even using identical sequencing techologies, there would be deviation in analysis approach because of the availability of samples, qualities of sequencing, absences of control. What’s more, it would be a completely different story to work on cell lines and patients’ samples, cancers and normal cells, finding simple nucleotide(s) variation and structural aberration. Hence, it is essential to fine tune the parameters indeed. The default setting approach migh be inadequate for your project’s need. This variability together with the growing NGS use imply there is a demand for in-house bioinformatics that vendor/service provider could not meet. Therefore, the ability to perform basic bioinformatics analysis could certainly shine your CV. It lets you stand out from other biologists, and increase your chance to get hired in academics.

Aside from the gained edge on employment, the techniques could help not only in analyzing NGS, but also in anticipating your daily experimental result. You will probably be able to write simple codes (not essentially effective or proficient though) in R/Python after your first journey in bioinformatics. Knowing R/Python is the additional bouns. Either languages have a vibrant community of biologists, which means there are many active developers who create and maintain libraries relating to biological data — flow cytometry, protein array, growth rate, drug combination synergy etc. (You may have a look at Bioconductor to be amazed at the amount of packages available.) Not to mention both R/Python support a really nice plotting functionality that could not be compare the office suite in any way — ggplot2, seaborn, plotly to name a few. Generate publication-ready graphics including heatmap and circular visualizations cannot be any simpler. Your presentation powerpoint will become more impressive, fancier and professional afterall.

Last but not least, the skill can enable you to comprehend your result instead of simply adopting it. Can you imagine a PhD student who can perform PCR but know nothing about the principle or rationale in the method and analysis procedure? Then why should it be the case for NGS? The problem of direct adoption is not hard to realise — you risk the chance to spot any missteps in the analysis, thus the opportunity to amend it and make a possibly profound discovery. Each NGS run can generate several GBs of data which cannot be easily comprehended as the traditional experimental data. You can quickly access the quality and inspect the result of your western blot by simply having a glimpse on the film, but you cannot open the FASTA file and magically realize the sequencing went wrong (unless it is literally empty or flooded with non-ATCG letters). Even the service provider will normally provide you a QC report to assure the data quality, it does not guarantee the data has been analyzed suitably in any way. If you have hands-on experience and recognise any questions, you can troubleshoot and rectify the procedures step by step all by yourself. You can employ neccessary correction timely. I am not saying contact a bioinformatician is a bad idea, but it may not be feasible to do so. You can of course contact the service provider back, they are probably unwilling to redo the analysis indeed. All in all, the technique equip you with the ability to deal with and hence tackle possible and potential problem quickly. It may save you considerable time and, most importantly, safeguard your project.

As I have said, learning bioinformatics is rewarding. At least, I see it is rewarding to myself, my project and my future. The reason to learn may not be obvious in the first sight, but it is a worthwhile long term investment.

Why Should Biologists Analyze NGS Data Themselves?

Written by Zeo Choy