This is an anecdotal overview of the Bruin in Genomics (B.I.G) Summer Research Program hosted by the UCLA Institute for Quantitative and Computational Biosciences. To learn more about their research opportunities, click here.
Bioinformatics, a term enveloping countless distinct areas of research that ultimately simplifies to the computational interpretations of biological problems, was the epicenter of my summer experience this year. Living in a two-person apartment at UCLA working variable hours generally between 10 AM to 6 PM, I had some time to reflect on how and why I had become fascinated with the intersection between computational analysis and biology. The Bruin in Genomics (B.I.G) Summer program not only provided a unique opportunity to work full-time on applied research, but also helped me set new goals on what I can look forward to with my current educational background in computer science.
The growing prominence of bioinformatics in medical research resulted from a rise in popularity of big data analytics. The established tech firms of FAANG and growing tech startups have aimed to use statistical techniques in analyzing large volumes of user data. Whether curating relevant news stories or providing the right advertisement at the right time, data has become the new major economic resource of the 21st century. Even back in 2013, 90% of all existing data was created in the previous two years alone. The industry is already shifting focus toward data; according to the World Economic Forum, 85% of surveyed companies are likely to adopt data analytics. This new field of computational analysis has been coined as “Big Data,” aimed at reinventing traditional methods to extract useful information and reach new conclusions. Hence, bioinformatics aims to solve the greatest Big Data challenge of them all: cracking down the genome.
The Human Genome Project accelerated the development of DNA sequencing methods that produce a wealth of high-quality genomic data at the hands of researchers. The cost of sequencing DNA has dramatically lowered since the first attempt, from $100M to almost $1K. This affordability has led to the creation of an accessible and comprehensive database of reference genomic data. Most data that bioinformatics researchers use are readily accessible on the UCSC Human Genome Browser, which provides high-coverage genomic data on a range of species. As computers have also become much more capable in the past decade, DNA sequencing and computational prowess have now intersected at an optimal point. With hundreds of thousands of genomes sequenced and available for open access, we now have enough data and power to solve the problem.
“In the year 2020 you will be able to go into the drug store, have your DNA sequence read in an hour or so, and given back to you on a compact disc so you can analyse it” — Walter Gilbert, 1980
The question now remains, what problem do we solve with all this data? While sequencing methods have improved, there has been a lack of consistency on which algorithmic tools should be used for analysis. I aimed to tackle this issue in my summer research, comprehensively evaluating structural variant callers across different types of genomic data. One example of the data we used for experimentation was the human reference data available in the UCSC Genome Browser.
Structural variants(SV) are unexpected changes in an individual’s DNA. One example of such a variant is a deletion, where a part of the DNA sequence is lost during replication. These chromosomal alterations are a source of human diversity and disease susceptibility. As we have developed a comprehensive reference genome that we can use to compare genetic samples from other individuals, we can detect SVs with varying degrees of accuracy. Greater accuracy can lead to greater efficiency in finding the exact DNA changes that correspond to a peculiar human disease. There are different types of algorithms that current structural variant callers use to detect SVs; the figure below demonstrates the most common methods used to detect deletions:
Precise detection of structural variants aims to solve the issue of identifying mutations, accelerating findings in genetic causes for a range of diseases. SV detection, however, is only one of the active research problems in the field of bioinformatics. During the B.I.G research program, I was also exposed to error-correction methods, which aim to correct sequencing errors in generating reads. These errors are unavoidable, as the most accurate sequencing technologies still produce systematic errors of approximately 0.1%. More accurate error correction methods will allow results to be more reproducible and dramatically improve the robustness of genetics experiments.
In response to active problems in genetics, data analytics and computational methods have led a new era of cost-efficient and accurate solutions. As computing becomes powerful and analytical methods such as machine learning continue to progress, we can reach a significantly more advanced understanding of the genome. B.I.G Summer has provided a truly unique opportunity for me to explore the applications of computational methods in a developing field that will impact the future of healthcare and medical research. Modern computing does not have to be limited to the confined use cases of curating advertisements or news stories. The combined efforts of medicine, biology, computer science, and data analytics in bioinformatics represent the importance of interdisciplinary research, shaping not only academia but also the future of the industry.
I would like to express my gratitude to everyone at B.I.G Summer, my ZarLab colleagues (the Serghang), my mentor Serghei Mangul, and Professor Eleazar Eskin for making this memorable summer experience possible.