In Silico Drug Discovery using Streamlit and Snowflake Notebooks with Snowflake

Published in

Snowflake Engineering

8 min readJul 19, 2024

Introduction

In recent years, the advancement of sophisticated computer technologies and algorithms has made it possible to analyze large datasets. This enables rapid prediction of compound characteristics and interactions. In silico drug discovery is a method of developing new drugs using computer simulations and analyses. Traditional drug discovery processes are costly and time-consuming. In silico drug discovery reduces expensive experiments in the early stages and allows narrowing down promising candidates through computer simulations. This method has garnered significant attention in recent years due to its efficiency and cost-effectiveness compared to traditional experimental methods.

With Snowflake’s high-performance computing resources, conducting in silico drug discovery using Snowflake is possible. In this article, I will explain a part of in silico drug discovery using Streamlit and Snowflake Notebooks.

About Streamlit and Snowflake Notebooks

First, let’s explain Streamlit and Snowflake Notebooks, which will be used for in silico drug discovery this time. Both are native features of Snowflake. Let’s delve into each of them.

Features of Streamlit

Streamlit is an open-source Python library that allows you to quickly and easily visualize data as interactive web applications. It is possible to build without the need for knowledge of web application development technologies like HTML, CSS, or Javascript. In Snowflake, there is also a feature called Streamlit in Snowflake, which enables the construction, deployment, and sharing of Streamlit apps on Snowflake.

Features of Snowflake Notebooks

Snowflake Notebooks serve as an interactive cell-based programming environment for Python and SQL within the development interface of Snowsight. It allows for interactive visualization of data using other libraries like Streamlit. Explore data already in Snowflake or upload to Snowflake from local files, external cloud storage, and more.

Virtual Screening Using Streamlit and Snowflake Notebooks

Now, let’s actually conduct in-silico drug discovery with Snowflake. This time, we will introduce virtual screening using compound similarity. When a compound that could be a candidate for a drug is disclosed in patents or papers, you may want to search for more promising similar compounds. Here, we will screen for compounds similar to the neuraminidase inhibitor drug “Laninamivir” used as an influenza treatment, pick up only the top 10 most similar compounds based on Tanimoto coefficient from ZINC DB data.

The ZINC (ZINC Is Not Commercial) Database is a freely available compound database mainly used for virtual screening and drug discovery research. This database is widely used in the fields of computational chemistry and bioinformatics. ZINC DB contains 3D structures of millions of commercially available compounds, many of which exhibit drug-like properties and are promising lead compounds. Additionally, compound data can be downloaded in various formats such as PDB, SDF, MOL2, SMILES.

In silico Drug Discovery: “Evaluation of Compound Similarities”

First, let’s introduce a sample code for evaluating compound similarities. In the example below, we measure the similarity between Cc1ccccc1 (Toluene) and Clc1ccccc1 (Trichlorobenzene) using the Tanimoto coefficient.

import streamlit as st
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem, Draw
from rdkit.Chem.Draw import IPythonConsole
from PIL import Image
import io

mol1 = Chem.MolFromSmiles("Cc1ccccc1")
mol2 = Chem.MolFromSmiles("Clc1ccccc1")

img1 = Draw.MolToImage(mol1)
img2 = Draw.MolToImage(mol2)

st.image(img1, caption="Molecule 1")
st.image(img2, caption="Molecule 2")

DataStructs.TanimotoSimilarity(fp1, fp2)
The main library used here is RDKit. RDKit is an open-source chemoinformatics library used in fields such as chemoinformatics, bioinformatics, and drug development for operations, simulations, analysis, and drawing of molecules. Other modules used are as follows:

Chem: Module for molecule creation and manipulation.
DataStructs: Module for calculating similarity between molecules.
AllChem: Provides advanced features for structure drawing and calculations.
Draw: Module for molecule drawing.
IPythonConsole: Module supporting molecule drawing in IPython.

We generate molecular structures from SMILES notation. Using the Chem.MolFromSmiles function, we create molecule objects from SMILES notation. Next, we calculate the molecular fingerprints. Here, we use Morgan fingerprints (ECFP). We draw images of the generated molecular structures and display them using Streamlit. Finally, we evaluate the similarity between Toluene and Trichlorobenzene using the Tanimoto coefficient, which resulted in “0.5384615384615384”.

In silico Drug Discovery: “Virtual Screening”

Now, let’s dive into virtual screening. Since the molecular weight of Laninamivir is approximately 350 and its ALogP is -2.92, I will select a subset of 3.4 million compounds from the ZINC database with molecular weights between 350 and 375 and LogP values below -1. I will download only the initial set for further analysis.

Please execute the following command in your terminal to download the file:

wget http://files.docking.org/2D/EA/EAED.smi

Once the download is complete, upload the file to your notebook.

Next, I will showcase the code snippets used for analysis.

# cell 1
spl = Chem.rdmolfiles.SmilesMolSupplier("EAED.smi")
len(spl)

# cell 2
laninamivir = Chem.MolFromSmiles("CO[C@H]([C@H](O)[C@H](O)CO[C@H]1OC(=C[C@H](NC(=N)N)[C@H]1NC(=O)C)C(=O)O")
laninamivir_fp = AllChem.GetMorganFingerprint(laninamivir, 2)

def calc_laninamivir_similarity(mol):
    fp = AllChem.GetMorganFingerprint(mol, 2)
    sim = DataStructs.TanimotoSimilarity(laninamivir_fp, fp)
    return sim

# cell 3
similar_mols = []
for mol in spl:
    sim = calc_laninamivir_similarity(mol)
    if sim > 0.2:
        similar_mols.append((mol, sim))

# cell 4
similar_mols.sort(key=lambda x: x[1], reverse=True)
mols = [x[0] for x in similar_mols[:10]]

# cell 5
def visualize_mols(mols, grid_size=(5, 2)):
    st.header("Visualized Molecules")

    img_per_row, img_per_col = grid_size
    mol_imgs = [Draw.MolToImage(mol) for mol in mols]

    width, height = mol_imgs[0].size
    canvas_width = width * img_per_row
    canvas_height = height * img_per_col

    canvas = Image.new('RGB', (canvas_width, canvas_height))

    for i, img in enumerate(mol_imgs):
        row = i // img_per_row
        col = i % img_per_row
        x = col * width
        y = row * height
        canvas.paste(img, (x, y))

    buf = io.BytesIO()
    canvas.save(buf, format='PNG')
    st.image(buf.getvalue(), use_column_width=True)

visualize_mols(mols)

Execution Results

By running the above code, you can screen the top 10 compounds similar to the neuraminidase inhibitor “Laninamivir” on Snowflake Notebooks and visualize the molecular structures using Streamlit.

Code Explanation

This Python script is a Streamlit application that calculates the similarity of chemical molecules and visualizes similar molecules. Below are explanations for each cell.

Cell 1: Loading SMILES file

spl = Chem.rdmolfiles.SmilesMolSupplier("EAED.smi")
len(spl)

In this cell, it loads SMILES format chemical molecules from a file named `EAED.smi`. It uses `SmilesMolSupplier` to load all molecules in the file and store them in a list `spl`. `len(spl)` returns the number of loaded molecules.

Cell 2: Calculating fingerprint of reference molecule

laninamivir = Chem.MolFromSmiles("CO[C@H]([C@H](O)[C@H](O)CO[C@H]1OC(=C[C@H](NC(=N)N)[C@H]1NC(=O)C)C(=O)O")
laninamivir_fp = AllChem.GetMorganFingerprint(laninamivir, 2)
def calc_laninamivir_similarity(mol):
 fp = AllChem.GetMorganFingerprint(mol, 2)
 sim = DataStructs.TanimotoSimilarity(laninamivir_fp, fp)
 return sim

Here, it generates a reference chemical molecule “Laninamivir” from a SMILES string and calculates its molecular fingerprint. Additionally, it defines a function `calc_laninamivir_similarity` to calculate the similarity between other molecules. This function computes the fingerprint of the passed molecule as an argument and returns the Tanimoto similarity with the fingerprint of Laninamivir.

Cell 3: Filtering similar molecules

similar_mols = []
for mol in spl:
 sim = calc_laninamivir_similarity(mol)
 if sim > 0.2:
 similar_mols.append((mol, sim))

In this cell, it calculates the similarity with Laninamivir for all molecules in the `spl` list. It adds molecules with similarity greater than 0.2 to the `similar_mols` list.

Cell 4: Sorting and selecting similar molecules

similar_mols.sort(key=lambda x: x[1], reverse=True)
mols = [x[0] for x in similar_mols[:10]]

Here, it sorts the `similar_mols` list in descending order of similarity and selects the top 10 most similar molecules. These selected molecules are stored in the `mols` list.

Cell 5: Visualizing molecules

def visualize_mols(mols, grid_size=(5, 2)):
 st.header("Visualized Molecules")
img_per_row, img_per_col = grid_size
 mol_imgs = [Draw.MolToImage(mol) for mol in mols]
width, height = mol_imgs[0].size
 canvas_width = width * img_per_row
 canvas_height = height * img_per_col
canvas = Image.new('RGB', (canvas_width, canvas_height))
for i, img in enumerate(mol_imgs):
 row = i // img_per_row
 col = i % img_per_row
 x = col * width
 y = row * height
 canvas.paste(img, (x, y))
buf = io.BytesIO()
 canvas.save(buf, format='PNG')
 st.image(buf.getvalue(), use_column_width=True)
visualize_mols(mols)

Finally, it defines and executes a function `visualize_mols` to visualize the selected molecules in a grid format. This function converts the given list of molecules into images and arranges them on a canvas according to the specified grid size (in this case, 5x2). The image of the canvas is saved in a buffer and displayed using Streamlit.

Benefits of Conducting In Silico Drug Discovery with Snowflake

After trying virtual screening for in silico drug discovery using Snowflake, I noticed various benefits and would like to summarize them.

1. Utilization of Machine Learning Frameworks

By utilizing frameworks such as Snowpark ML, complex infrastructure setup and local environment configuration become unnecessary. Snowflake automatically scales, making it suitable for training large datasets and complex machine learning models. Users can expand resources as needed and maximize computational power.

2. Efficient Storage of Large Compound Datasets

Snowflake enables centralized management of large compound and biological datasets required in the drug discovery process. This enhances data visibility, allowing researchers to access necessary data promptly. Additionally, leveraging Snowflake’s powerful query engine, researchers can rapidly search and filter extensive compound datasets.

3. Seamless Analysis up to Analyzing Computations Requiring High-Performance Computing Resources

With Snowpark Container Service, you can perform analyses requiring advanced computational resources such as molecular dynamics (MD) simulations, docking simulations, predicting 3D structures of proteins, and visually representing molecular structures in PyMOL. It is possible to conduct end-to-end analysis up to computations demanding high-performance computing resources. Researchers can execute computationally intensive simulations seamlessly within the end-to-end analytical workflow.

PyMOL (ref: https://pymol.org/animate.html?)

In Conclusion

How was it? Streamlit in Snowflake and Snowflake Notebooks are valuable not only for data scientists but also for informaticians working in chemoinformatics and bioinformatics. In this instance, I used SMILES format data uploaded onto Notebooks, but by pre-uploading it to a table, screening can be done even faster. Additionally, setting up ChEMBL data from the Registry of Open Data on AWS as an external stage, as described in this article (https://medium.com/snowflake-engineering/variant-filtering-of-genome-vcf-files-with-snowflake-utilizing-the-registry-of-open-data-on-aws-670dd433429e), enables efficient utilization of public data. I aim to create an even more optimal environment for in silico drug discovery.