Deep self-supervised learning for biosynthetic gene cluster detection and product classification

Abish Pius
Computational Biology Papers
4 min readMay 28, 2023
BiGCARP architecture with validation performance curves on the self-supervised dataset.

Full Article: Deep self-supervised learning for biosynthetic gene cluster detection and product classification | PLOS Computational Biology

Citation: Rios-Martinez, Carolina, et al. “Deep self-supervised learning for biosynthetic gene cluster detection and product classification.” PLOS Computational Biology 19.5 (2023): e1011162.

Overview

The researchers developed a self-supervised learning approach to identify and analyze biosynthetic gene clusters (BGCs) in microbial genomes. They represented BGCs as chains of functional protein domains and trained a masked language model on these domains. The results showed that their approach successfully detected BGCs, characterized their properties, and predicted BGC product classes. This study highlights the potential of self-supervised neural networks to improve BGC prediction and classification, which could aid in the discovery and understanding of natural products in the pharmaceutical industry.

Background

This article discusses the importance of natural products in pharmaceuticals and the need for new methods to discover them. Biosynthetic gene clusters (BGCs) play a key role in the production of natural products, and genome mining has become a powerful tool for exploring their chemical diversity. However, there is still much unexplored potential in microbial BGCs. The article introduces a self-supervised learning approach called BiGCARP, which uses a masked language model to identify and characterize BGCs from genomic data. The model shows promising results in detecting BGCs, predicting their product classes, and improving BGC discovery pipelines compared to existing methods like DeepBGC. This approach has the potential to accelerate the discovery of novel natural products with therapeutic relevance.

Results

The article presents the results of using the self-supervised learning approach called BiGCARP to discover and characterize biosynthetic gene clusters (BGCs) in genomic data. The training scheme involved representing BGCs as sequences of Pfam domains and using a masked language model to reconstruct the original class token and Pfam sequence. The model was trained on a dataset of approximately 127,000 BGC Pfam sequences extracted from the antiSMASH database.

The results showed that BiGCARP achieved promising performance in multiple tasks. It captured relevant representations of Pfam domains, with embeddings showing clear clusters of structurally related domains. In pretraining evaluation, BiGCARP achieved low exponentiated cross entropy (ECE) on Pfam domains and showed the ability to detect BGC start locations and identify domains within BGCs. The model outperformed DeepBGC on domain-level classification performance, demonstrating improved accuracy.

Additionally, BiGCARP successfully predicted BGC product classes and outperformed DeepBGC in average AUROC across product classes. Ensembling the predictions of the three versions of BiGCARP further improved accuracy. However, DeepBGC performed better than BiGCARP on precision and recall, likely due to training on the MIBiG dataset with calibrated predictions.

In the task of identifying BGCs from unannotated microbial genomes, BiGCARP demonstrated its effectiveness by predicting the locations of clusters in bacterial genomes and identifying potential start locations for further investigation.

Overall, the results indicate that BiGCARP, with its self-supervised learning approach, shows promise in accelerating the discovery and characterization of biosynthetic gene clusters and their associated natural products.

Discussion

The researchers have developed BiGCARP, a masked language model that learns representations of biosynthetic gene clusters (BGCs) based on their Pfam domains. This model can detect BGCs and predict their product classes. It is the first work to use Pfam domains as tokens in a masked language model. The model demonstrates strong BGC detection capabilities and achieves state-of-the-art accuracy in product class prediction. The study opens up opportunities for future method development and model refinement, such as exploring other protein sequence pretraining methods and evaluating different BGC representations. The BiGCARP model can be fine-tuned for downstream tasks like predicting expression conditions or chemical structures of BGC products. Without fine-tuning, it is useful for detecting unknown BGCs and predicting their product classes. Overall, this research showcases the potential of self-supervised deep learning for BGC discovery and characterization.

--

--

Abish Pius
Computational Biology Papers

Data Science Professional, Python Enthusiast, turned LLM Engineer