Accelerating Drug Discovery Research with AI/ML Building Blocks and Streamlit

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

6 min readJun 21, 2023

Life sciences organizations’ inability to analyze their scientific data may be hampering innovation, including their ability to leverage AI and ML tools, and ultimately slowing down progress in drug discovery.

Scientific data is incredibly challenging to work with, as any researcher will tell you:

The data is siloed and not easily accessible. Whether it’s hidden in your own systems or archives, a third-party partner, or a research institution, it’s difficult to discover and bring together. And trying to curate or organize archived data from different sources for cleansing and normalization is also challenging and time-consuming.
Adhering to regulatory requirements when collaborating on sensitive data sets is an issue.
The ability to scale across (handling multiple competing workloads at the same time), scale up (dealing with larger data volumes and complex queries), and scale out (dealing with concurrency and the inability to spread queries across more clusters) adds another layer of complexity for querying scientific data.
Data variety in the field of drug discovery ranges from -OMICS files and imaging data to chemical notations. As a result, researchers face the challenge of analyzing the datasets separately, often relying on isolated applications or environments.

New Powerful Tools to Accelerate Data Analysis and AI/ML Solutions

With new capabilities from Streamlit and Snowpark for Python, powerful application development tools, combined with core capabilities from Snowflake’s Healthcare and Life Sciences Data Cloud, researchers now have a powerful solution to break down data silos, seamlessly scale for any workload, streamline data management and analysis, and ensure secure and compliant collaboration. By leveraging Snowflake’s capabilities, early drug discovery research can be accelerated, overcoming data challenges and fostering a data-driven culture that unlocks the full potential of AI/ML within the life sciences industry. The integration of AI/ML is not just advantageous but critical as it is revolutionizing various aspects of the drug discovery process.

Examples of AI/ML Use Cases for Drug Discovery:

Virtual Screening and Drug Design: AI/ML algorithms rapidly screen large compound libraries, prioritizing potential drug candidates based on chemical properties and target interactions, accelerating the drug discovery process
Predictive Modeling for Drug Activity and Toxicity: AI/ML models analyze experimental data and molecular descriptors to predict drug activity and toxicity, reducing reliance on costly and time-consuming assays and aiding in compound selection
Repurposing Existing Drugs: AI/ML techniques identify new therapeutic indications for existing drugs by analyzing comprehensive data sets, accelerating drug development by repurposing compounds with established safety profiles

Source: “Artificial intelligence in drug discovery and development,” National Library of Medicine

Example Streamlit App for Drug Discovery

The diagram below is a high level overview of how we built a drug discovery Streamlit application powered by Snowflake. Each page of the app focuses on different phases in the drug discovery process, unlocking AI/ML and innovation for our fictitious pharma company, Icicle Therapeutics.

The drug discovery process starts by prioritizing diseases and increasing confidence around the associated targets. Then, once a target is chosen, Icicle Therapeutics will focus on learning more about the chosen target, for example a protein, to design effective drugs that interact with that target. Once potential drug leads are generated, Icicle Therapeutics enlists a Contract Research Organization, CRO, to test the compounds. The CRO conducts the screening and shares the high-content screening results back with Icicle Therapeutics using Snowflake Collaboration and unstructured file (image) processing capabilities.

In the context of small molecule drug discovery, for our analysis, we are going to leverage publicly available data sources to researchers: CHEMBL, OpenTargets, and ClinicalTrials.gov. The clinical trials data set is also available on our Snowflake Data Marketplace, which means you can skip the ELT process. Below is a depiction of how combining these data sets is critical in drug discovery research.

The intersection of CHEMBL and OpenTargets represents potential therapeutic targets that have associated small molecules in CHEMBL and genetic evidence in OpenTargets
The intersection of OpenTargets and ClinicalTrials.gov represents potential therapeutic targets supported by genetic data and the associated diseases that are being investigated in clinical trials
The intersection of CHEMBL and ClinicalTrials.gov represents the potential use of specific small molecules and their targets from CHEMBL in clinical trials
Finally, the center where all three circles overlap represents the convergence of chemical data, genetic evidence, and clinical trials, providing a comprehensive perspective for drug discovery research

With Snowflake support for all types of data, we can easily load the data available in Parquet and JSON formats with simple COPY commands from the cloud storage stage using Snowflake storage integration. Additionally, we have loaded our proprietary dataset of drug candidates, replicated from an external compound registry application, which we want to analyze against the CHEMBL data. We now have all data centralized to build the first two pages of our Streamlit app that harmonizes these data sets.

Streamlit Page 1: Disease and Target Explorer

This page can help you to identify potential targets for drug development. You can search for diseases of interest and view related targets. Targets are specific proteins or molecules that play a role in disease processes. You can filter targets by how highly associated they are with the diseases and view related clinical trials and drugs.

The Streamlit code for this page is less than 300 lines long and all of the compute is pushed down to Snowflake so the data volumes and performance is no longer a challenge. A key value of Streamlit is the ability to quickly build and iterate on an app that is specific to your research and does not require app development skills.

For the next stage of our research, we want to move along the drug discovery process and begin designing chemical compounds against our chosen disease target.

Streamlit Page 2: Compound Design and Analysis

This page can help you design and assess the properties of potential drug candidates. Start by inputting a single protein target name and get the amino acid sequences from CHEMBL. We can then pass the sequence of interest to our Snowpark Python UDF referencing the Biopython library through the Anaconda channel for Snowpark to calculate more information including sequence length, molecular weight, and amino acid percent. You then can view a list of related compounds from CHEMBL and their associated SMILES codes to help you see how known compounds are designed against the target. Finally, using the RDKit Library in another Snowpark Python UDF, you can calculate molecular descriptors and Tanimoto scores for a novel compound to assess its drug-like properties and similarity to a known compound repository. See the cheminformatics blog post here to view example code for this UDF: Cheminformatics in Snowflake: Using Rdkit & Snowpark to Analyze Molecular Data.

Streamlit Page 3: Sharing High-Content Screening Results

This page will help you evaluate the biological effects of potential drug candidates. You can review HCS image data of cells that were exposed to the lead compounds. Using Snowpark UDF to dynamically access unstructured files and extract features using python libraries like skit-image and opencv, you can count the number of cells in the images and assess the effects of the compounds on cell viability and proliferation. Check our docs to see more details on how Snowflake supports unstructured files.

The images and the extracted features are stored in the CRO owned Snowflake account and are exposed to the Icicle Therapeutics account through Snowflake’s secure data sharing capabilities. Follow the simple steps listed in our docs to get started with unstructured data sharing!

We hope this helps you envision how you can easily build your research data hub on Snowflake and potential data applications that can empower your scientific teams. With the continued development of Streamlit in Snowflake, you can bring custom Python based analyses and AI/ML functions to scientists, enabling them to make more informed decisions in the pursuit of new treatments and therapies.

To see these capabilities in action, please consider visiting the Healthcare & Life Sciences team at Summit June 26–29 in Las Vegas. Please stay tuned for a deeper dive into AI/ML capabilities within life sciences in our second installment of this blog series. We will expand on the foundation that we laid above to train, deploy, and apply an ML model in Snowflake to predict the activity of novel chemical compounds.

For more information about how Snowflake can help your life sciences organization, visit Snowflake’s Healthcare & Life Sciences.