Snowflake in Drug Discovery: Leveraging BioNeMo as an external LLM & Snowpark container services for the Protein folding problem

and Eda Johnson with contributions from Siddha Ganju from NVIDIA

The blog explores how Snowflake’s container services can be leveraged in a scientific domain using the example of a protein folding algorithm that is hosted in an external infrastructure managed by NVIDIA. It discusses the integration of Snowflake with these models using Openfold as an example to demonstrate a streamlined process for protein folding, data management and data visualization. It also aims to emphasize the use of cloud-based platforms for enhancing research and discovery processes in the field of molecular biology with the latest offering from Snowflake, namely Snowpark container services, currently in Private Preview at the time of writing.

Figure 1: A 3D structure of alpha synuclein, protein involved in Parkinson’s disease

Lifesciences LLM’s and protein folding

Life Sciences has different problem statements in the realm of Gen AI and most of them are in the context of content summarization or content tagging like asking ChatGPT to write an abstract. However, within the construct of drug discovery and the research value chain of the pharmaceuticals, the large language models have been leveraged for other purposes including studying the protein folding problem. Alpha Fold, from Deepmind made headlines in 2020 when it participated in the Critical Assessment of Structure Prediction (CASP) competition and outperformed all other methods. Since its announcement Alphafold has revolutionized the way scientists understand and predict the three-dimensional structures of proteins, which is a crucial step to understanding the underlying causes for disease and developing cures. It has predicted protein structures with remarkable accuracy, comparable to experimental methods and this achievement marked a significant milestone in the field of structural bioinformatics and paved the way for introducing Generative AI to a whole new domain.

Why are protein structures important and how do protein folding algorithms work?

Developing a new drug involves three key stages: Understanding the biology, designing molecule/biomolecules and finally measuring how that drug interacts with the proteins of the body. Proteins, made of amino acids, are the workhorses of biology, responsible for countless cellular functions. To understand how proteins function, scientists need to know their precise three-dimensional structures.

Traditional methods for determining protein structures, like X-ray crystallography and cryo-electron microscopy, are time-consuming and expensive. AlphaFold aims to address these challenges and combines deep learning algorithms with vast amounts of genomic and protein sequence data to predict the 3D structure of a protein. It does this by analyzing its amino acid sequence and understanding the physical and chemical forces that govern how the protein folds into its final shape. The system’s neural networks are trained on a massive database of known protein structures that are publically available. Subsequent to the discovery of AlphaFold there have been many other improvements and newer models that tackle this problem like ESMfold and Openfold have been made available. Openfold described here is used as an example in this blog for purposes of demonstrations.

In summary protein folding LLM models are of two types:

  1. MSA (multiple sequence alignment based) like AlphaFold2, OpenFold which predicts the structure of a protein from its amino acid sequence
  2. Embedding based like ESMFold which is significantly faster as it utilizes ESM2 protein LLM embeddings for structure prediction

How does this relate to Snowflake?

In this blog we take an initial step to illustrate how you could take advantage of Snowflake’s container services to leverage an externally hosted protein folding model (BioNeMo from NVIDIA) and create a simple Streamlit app that allows quick interaction of the fold structure and exploration of data in a snowflake managed store as shown in Figure 2 below.

Figure 2: Conceptual view of the protein folding within Snowflake

The next version of the blog will focus on model hosting and fine tuning within Container services with an Open source model as an example in the realm of R&D.

The models offered by NVIDIA BioNeMo ecosystem

Until recently virtual screening has been one of the methods to develop protein and small-molecule therapeutics, but this is still incredibly complex and computationally expensive. Generative AI has accelerated virtual screening and NVIDIA BioNeMo Service makes these state-of-the-art models for drug discovery available as a cloud service, providing instant and easy access to accelerate drug discovery pipelines.

There are nine state of the art models currently available in BioNeMo Service for protein generation, protein embedding, molecule generation, molecule embedding, protein folding, and docking out of which we have already described protein folding earlier. The other models include:

  1. Protein Generation model: ProtGPT2
    used for ‘unconditional’ protein generation. It generates “de novo” protein sequences following the principles of natural ones. This model can be used to generate novel sequences, which can be later filtered based on their functional properties, manufacturability, etc.
  2. Protein Embedding models: ESM-1nv and ESM-2
    for obtaining learned representations of protein sequence space, as these LLMs have been trained using millions of known protein sequences. They produce embeddings that are predictive, which can be used to build downstream task models — such as protein property predictions like protein thermostability
  3. Molecule Generation models: MegaMolBART and MoFlow
    These models allow molecule generation in a controlled manner for a specific purpose. This is very important in exploring the chemical space for molecule design, especially in the Lead-optimization stage in drug discovery.For example, we may want to generate molecules similar to the seed molecule, fit into the active site of a protein, and have specific properties, such as novelty, solubility, Blood-brain barrier permeability, etc.
  4. Molecule Embedding model: MegaMolBART
    similar to protein embeddings, these are learned representations of small-molecular chemical space. The molecular embeddings can be used for building predictive models for downstream tasks such as reaction predictions, molecular solubility, etc.
  5. Docking model: DiffDock
    Docking is a key step in the virtual screening pipeline. DiffDock, a diffusion based model, can take 3D inputs of a small molecule and a protein — and predicts the interaction geometries for the resulting protein-ligand complex and has been shown to outperform traditional docking tools in terms of accuracy and speed.

Bringing it together: Solution architecture with Snowflake and BioNeMo

Typically, protein sequences for folding come in the form of amino acid sequences that can be managed in a file format like fasta. In this example, we have downloaded a few sequences from Uniprot and stored them in an AWS S3 bucket which was mounted as an internal Stage in Snowflake. A snapshot of the fasta containing the amino acid sequence is seen below:

Figure 3: A sample FASTA format representing the protein sequence

The sequences are stored in a “variant” column along with the metadata from annotations, like Uniprot Id, protein full name , etc that can be retrieved from the header row as can be seen in Figure 4. This allows for future business metadata store that when integrated with a Streamlit UI for search would be the first step towards the creation of what we call a FAIR data product.

FAIR in life sciences refers to the ability to describe a fully defined data product as one which is findable, accessible , interoperable and hence reusable. More details about what is FAIR in life sciences can be found here.

It is also fine to keep the sequences in internal stage as FASTA and just provide a scoped URL link to the file as an alternate pattern as the sequences then can be governed without having to ingest them and still be able to leverage all the Snowflake security.

Figure 4: A snapshot of the table in Snowflake storing the amino acid sequence and metadata

Next, we call the BioNeMo API for protein folding as can be seen in Figure 5. In this case, we leverage the BioNeMo service offered by NVIDIA which is hosted on their NVIDIA DGX Cloud.

Please do note that BioNeMo is a proprietary model hosted by NVIDIA and will need a private API key to be invoked. If you were to replicate this, you will need to establish a contract with NVIDIA and obtain the key separately.

Figure 5: A simple API call with the sequence string to BioNemo allows the LLM to be executed

The LLM returns the 3D folded structure as a “pdb” file as can be seen in the snapshot above. This PDB (BioNemo_OpenFold_generated1.pdb in the above example) is then stored again back to a snowflake managed store automatically and the scoped URL to this link is added to the snowflake column. This allows for retrieving the pdb for rendering to Streamlit as and when needed based on metadata and keyword search.

The Streamlit app is a simple Python based way to create apps and to interact with machine learning outcomes and it is also hosted in container services within Snowflake. It reads the PDB and leverages the python libraries to render the 3D visual (py3dmol) as seen in the snapshot below.

Figure 6: Python code to visualize the 3D protein structure leveraging the py3Dmol library

The final Streamlit app is a simplified interface in which end users can interact and visualize protein 3D structures from the internal protein data bank all securely managed within Snowflake.

Figure 7: Streamlit application with the folded proteins and associated description retrieved from Uniprot

Creating this entire notebook, Streamlit experience takes less than 30 mins , providing you have the ability to manage your entire data, metadata within the security and framework of Snowflake.

Why Snowflake and what next?

All of the data, analysis results reside within Snowflake and the container service interaction with Snowflake data is single tenant. This ensures your downstream analysis is guarded by the same security and governance as any other workload within Snowflake including building simple Streamlit apps and sharing them

You can expand this now to running other algorithms including similarity search by a simple command in biopython to see which proteins are similar to others and calculating the distance matrix outputs in a table as well (like Bio.pairwise2).

You can now consolidate all your scientific data (proteomics, screening results) to create a research data hub and take advantage of a centralized data storage & analytics system. As mentioned earlier, combining business metadata store with multi omics and integrated with a Streamlit UI for search would be the first step towards the creation of what we call a FAIR data product.

Downstream analysis like structure search can also be performed by leveraging native RDKIT integrated within Snowflake. You can learn how Valo Health performs AI based drug discovery using RDKIT in this webinar.

Finally, like any other Snowflake story, we can allow for seamless data collaboration and exchange with partners that want to perform research together like how Francis Crick helped set up their Trusted Research Ecosystem.

In the next series, we will expand to see how we can leverage an open source model to run inside Snowpark container services like Llama2 to answer some life science questions with help of fine tuning. For a general example of fine tuning Llama2, on Snowpark container services please also refer to the associated article here

Conclusion

This blog has delved into a new paradigm for Snowflake that explains how it can be leveraged to seamlessly integrate the advanced MSA fold models into your research workflow. From the inception of protein sequences in formats like FASTA to the final rendering of 3D protein structures, this integration streamlines the process and enhances data security, all within the Snowflake ecosystem. The data, metadata and analysis results are stored within Snowflake, ensuring security and compliance.

As we wrap up this exploration, consider the following next steps on your journey:

  1. Leverage Open-Source Models: Explore the world of open-source models and containers within Snowflake. By fine-tuning these models, you can tailor them to answer specific life science questions and advance your research
  2. Private LLMs like BioNeMo : You can also take advantage of an externally hosted Private LLM like BioNeMo suite , which are optimized for performance compared to their open source version. Do note , that this would require your separate subscription and contract with the LLM vendor
  3. Streamlined Data Enrichment: Develop a user-friendly Streamlit interface that allows you to search for proteins or other data assets of interest via keywords. This interactive tool can enrich your research outputs and enhance your end user interaction of your data
  4. Stay Informed: Keep a close eye on advancements in AI and bioinformatics. The field is dynamic, and new breakthroughs are made regularly. Staying informed will ensure you remain at the forefront of innovation. For what’s new in Snowflake, you could follow the Snowflake handle in Medium as we will continue to update this with latest announcements and architectural patterns that you could leverage in this area

For more information, please feel free to talk to us to know more about how this would work and discuss your use cases.

Note: Please note that BioNeMo is a proprietary LLM hosted by NVIDIA in their NVIDIA DGX Cloud and will need a private API key to invoke their models and functions. We have invoked in this example the open fold model hosted in “https://api.bionemo.ngc.nvidia.com/v1"

Additional References & Material recap

Curious to know more about what protein folding and AI is all about? Here is a short video that explains the impact of AI on biology and traditional research

For Snowflake’s announcements on Gen AI and LLM please read them here

For more general information about Container Services on Snowflake, please refer the following link here

Learn how to build your own Streamlit apps here along with a detailed quickstart

Learn more about RDKIT on Snowflake here

--

--