The Dark Side of the Genome: Shedding Light on Human Disease-Causing Variants

Lillian Conger
The Eta Zeta Biology Journal
4 min readJan 20, 2024
Photo by Sangharsh Lohakare on Unsplash

Link to original article

Background

Studies of the human genome have been able to pinpoint areas of the genome that are associated with diseases. A limitation of these genome-wide association studies, however, is that they are not able to determine what specific genetic variants contribute to a disease. Many genes have been identified as possible variants, but genomic sequencing alone is unable to conclude which genes are relevant to human disease and which genes are false positives.

Regions of the genome called dark regions are difficult to decipher and assemble using short-reading sequencing (a method that breaks DNA into smaller fragments to be amplified by PCR and then sequenced). These regions are either dark-by-depth (very few reads from the sequenced sample align to a single region on the reference genome) or dark-by-alignment (sequences are duplicated or reads are mapped to multiple areas on the reference genome). Approximately 84 to 145 megabases of the genome are dark regions that are difficult to map and 748 to 2512 coding genes are partially dark. Between 70 to 450 of these dark genes are relevant human disease variants, creating a possible challenge for geneticists to discover what mutations contribute to disease.

Summary

To investigate whether dark regions are a barrier to identifying genetic variants that contribute to disease, the researchers focused on eight different diseases and traits: autism spectrum disorders, schizophrenia, body mass index, bipolar disorder, major depressive disorder, cholesterol, amyotrophic lateral sclerosis, and Crohn’s disease. For each trait, the researchers compared a list of dark regions and genes with annotated loci from genome-wide association studies (GWAS) in the FUMA public database. GWAS are studies on the genomes of many people to find genetic variations associated with a specific trait.

The researchers observed that 33–73% of the annotated GWAS loci (referred to as Genomic Risk Loci or GRLs) contained dark regions and 7–20% of the genes overlapped dark regions. 2.5% of those genes contained dark protein-coding regions. This small percentage is due to only some of the genes at each GRL having a part in causing a disease.

Next, the researchers investigated the dark regions’ potential to affect expression of a trait using Gene Ontology (GO) term enrichment. With this technique, the eight sets of dark GWAS genes were put into classes to determine whether certain types of genes are overrepresented and may be biologically relevant (associated with disease). The GRLs that returned significant false discovery rates (likely to be false positives) were determined to be non-dark. For schizophrenia, body mass index, and major depressive disorder, comparing the dark genes and remaining non-dark GRLs allowed the researchers to refine the biological relevance of the GO terms. Overall, the researchers found that the dark genes were enriched for biologically relevant GO terms. These results suggest that there are genes associated with disease that are not available to short-reading sequencing methods, making these disease variants difficult or impossible to map.

The researchers also looked at whole exome sequencing (WES) studies, which are large-scale genetic tests to find changes or variants in a person’s DNA related to a disease or trait. Using the WES studies, the overlap of dark regions with protein-coding regions was observed from the Schizophrenia Exome Sequencing Meta-analysis (SCHEMA) and Autism Exome Sequencing (ASC) consortiums. In SCHEMA, 222 of the 928 genes had partially dark regions and 22 had partially dark coding regions. Of the 102 ASD-associated genes, 4 have dark protein-coding regions with two of the genes (SHANK3 and CORO1A) are more than 5% dark. These findings suggest that dark regions contained in genes associated with disease may not be accessible to short-read sequencing technologies and rare disease variants may be missed.

To overcome the difficulties of detecting disease variants in dark regions, the researchers suggest using long read sequencing technologies. This method sequences thousands of bases and has been shown to reduce the amount of dark gene regions by up to 77%. Also, re-analyzing WES data using methods that align ambiguous reads (such as repeats, insertions, and deletions) can allow those sequences to be successfully mapped, avoiding potential underestimations of the abundance of sequences in certain genes.

Conclusion

The number of dark regions, in coding and noncoding regions, varies depending on and the technology available and genome build (longer read lengths have fewer dark regions than shorter read lengths). The researchers acknowledge the limitations of the study due to its use of publicly available GWAS data from FUMA. The researchers propose that more recent, larger GWAS will identify more GRLs and greater numbers of dark regions that overlap with disease-associated genes. Despite the limitations of the study, the observed overlapping of dark regions with risk genes implies that dark genes are relevant to disease-risk. There is a need for awareness of these dark regions when using SRS to discover genetic disease variants because some variants may be inaccessible with fine-mapping technology alone and will be missed. Because approximately 10% of the genome is inaccessible to SRS, dark regions likely contribute to missed heritability. Using alternative methods such as long read sequencing can allow these dark regions to be investigated and reveal previously unknown disease variants.

--

--