5 Steps to Leverage Snowpark ML in Drug Discovery Research

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

8 min readNov 14, 2023

This is a second installment of the Snowflake for Drug Discovery series to demonstrate Snowflake capabilities in machine learning (ML) for early drug discovery research. If you missed the first installment, check out Accelerating Drug Discovery Research with AI/ML Building Blocks and Streamlit.

Here we present a technical walkthrough on how to leverage Snowflake to train, deploy, and apply an ML model to predict the activity of novel chemical compounds for early drug discovery. You can take the associated code and public data set and try it yourself with a free trial of Snowflake. Reach out to a Snowflake representative for the full copy of the code. (Disclaimer: The model itself is simplistic and not meant for production use cases, but more to highlight new capabilities of Snowflake for Data Science.)

Intro to Snowpark and Snowpark ML Modeling API

Snowpark is a set of libraries and runtimes in Snowflake that securely deploy and process non-SQL code, including Python, Java, and Scala. Snowflake recently announced a companion library for Snowpark for Python called Snowpark ML for data science use cases. Leveraging ML to test efficacy of different drug molecules against different disease models is just one of the many ways to apply Snowpark ML in your drug discovery research. To build a chemical activity prediction model, we will leverage Snowpark and Snowpark ML in a Jupyter notebook to run our feature engineering, model training, and model registration.

The finished data product is an example Streamlit in Snowflake app (illustrated below) that lets users predict the activity of a new compound leveraging the different models we make available through the new Snowflake model registry (currently in private preview). Snowpark ML integrated with Streamlit empowers researchers to swiftly gather feedback on models, fostering rapid iterations crucial in expediting the drug discovery processes.

Streamlit app with filters to help scientists find the drug activity model of interest and apply it on a novel chemical compound.

1. Load Chembl Data in Snowflake

For our drug prediction model we will use one of the data sources from the previous post: CHEMBL database. CHEMBL29 is available through the registry of open data through AWS S3. You can easily load this date into a Snowflake stage, referencing the public S3 URL location.

CREATE OR REPLACE STAGE ML_CHEMBL29_DATA URL='s3://aws-roda-hcls-datalake/chembl_29/';
--The data is in Parquet files, so we also need to create a file format
--in Snowflake to parse it:
CREATE FILE FORMAT IF NOT EXISTS ML_CHEMBL29_DB.ML_CHEMBL29_SCHEMA.PARQUETFORMAT
TYPE = 'PARQUET';

You can leverage Snowflake’s infer schema function to create the target tables associated with the different files, and then load the files from stage using the COPY command.

2. Data Preparation and Feature Engineering

The Snowpark ML modeling package (snowflake.ml.modeling) provides APIs for data preprocessing, feature engineering, and model training. With familiar dataframe-like syntax, and based on common data science libraries like sci-kit learn and xgboost, Snowpark ML is an easy-to-use solution for data scientists to adopt in their research workflows.

The features we will curate from the CHEMBL29 data set include the SMILES structures of the compounds and the following molecule descriptors that are part of Lipinski’s Rule of Five: number of hydrogen bond acceptors, number of hydrogen bond donors, isoelectric point, and molecular weight. CHEMBL also stores a field that aggregates the number of times a compound violates the Lipinski’s Rule of Five, which we will use as an additional feature in the model.

For our first model, we filtered for compounds associated with well-researched targets for Alzeihmer’s disease. You can adapt the filter to suit your research needs; for example, to select only one biological target. We are using the PCHEMBL field in the CHEMBL target_dictionary data set as a measure of each compound’s activity. The pChEMBL field is a useful way to represent the potency or affinity of compounds on a negative logarithmic scale. It allows you to compare different measurements (IC50, XC50, EC50, AC50, Ki, Kd, or Potency) in a standardized way. We can leverage one of Snowpark ML’s preprocessing classes, Binarizer, to binarize the PCHEMBL field so that values greater than 6 will be classified as 1 or Active, and 6 and below as 0 or Inactive.

Below is a screenshot of how this code would look in the new Snowflake Notebooks UI (PrPr) that was announced recently.

Snowflake Notebook UI private preview: Cell 1 and Cell 2 to import libraries and query snowflake to create chembl_features dataframe from a view created on the CHEMBL29 data

# create a binary classification label for each compound based on activity threshold
#The pChEMBL field in the ChEMBL database is a useful way to represent the potency or affinity of compounds on a negative logarithmic scale. It allows you to compare different measurements (IC50, XC50, EC50, AC50, Ki, Kd, or Potency) in a standardized way.
#Let's use Snowpark ML Preprocessing functions to binarize the dataframe https://docs.snowflake.com/en/developer-guide/snowpark-ml/reference/latest/api/modeling/snowflake.ml.modeling.preprocessing.Binarizer
snowml_bin= snowml.Binarizer(threshold= 6.0 ,input_cols=['PCHEMBL_VALUE'], output_cols=['ACTIVE'])
snowml_bin.fit(chembl_features)
normalized_chembl_df = snowml_bin.transform(chembl_features)
normalized_chembl_df.limit(10).show()
#count of how many rows were classified as active: 5719 active
print(normalized_chembl_df.filter((F.col("ACTIVE") == 1)).count())

“Active” column output of Snowpark ML binarizer function

Snowpark also allows you to run native Python functions on your data, so if you have custom functions that either leverage libraries supported via the Anaconda channel or in pure Python, they can be deployed on Snowflake via Snowpark User Defined Functions or Stored Procedures. For this example, we will add the molecular fingerprint of each molecule to our dataframe. You can calculate the Morgan fingerprint available through the Snowflake Anaconda channel via the RDKit library. Notice that the UDF also has a variable for the bits value associated with the fingerprint — this is so we can try different fingerprint lengths and compare the efficacy of each in our models.

#register morgan fingerprint udf (update stage location)
import snowflake.snowpark.types as T
import snowflake.snowpark.functions as F
@F.udf(input_types=[T.StringType(), T.IntegerType()],return_type= T.ArrayType(), 
       stage_location="snowpark_functions",is_permanent=True,name="MORGAN_FGP_BIT_UDF",
       replace=True, packages=["rdkit", "numpy"])

def udf(smiles, bit):
    from rdkit import Chem
    from rdkit.Chem import AllChem
    import numpy as np
    
    mol = Chem.MolFromSmiles(smiles)
    #the function GetMorganFingerprintAsBitVect () was used to create the fingerprint as a bit 
    # vector meaning the resulting vector will be composed on 0s and 1s. 
    # The 1s will reperesnt the presence of a certain molecule structure, while 0s will 
    # represent the absence of the same.
    fp =AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=bit)
    array=np.array(fp)
    return array

Test the function was registered and can be called.

bit=512
df = session.sql("select morgan_fgp_bit_udf('CC(=O)Nc1ccc(cc1)S(=O)(=O)N', {});".format(bit))
print(df.collect())

Output of calling Morgan_fgp_bit_udf function

After executing the UDF, we can use a few Snowpark functions to flatten and pivot the output of the array into individual columns, resulting in a final dataframe we can use to train our model.

fingerprint=normalized_chembl_df.select(F.call_udf("morgan_fgp_bit_udf", F.col("smiles"), bit).alias("fingerprint"), "smiles", normalized_chembl_df.col('HBA').cast("Float").alias("HBA"), normalized_chembl_df.col('HBD').cast("Float").alias("HBD"), normalized_chembl_df.col("LOGP").cast("Float").alias("LOGP"), normalized_chembl_df.col('NUM_LIPINSKI_VIOLATIONS').cast("Float").alias("NUM_LIPINSKI_VIOLATIONS"), normalized_chembl_df.col("MWT").cast("Float").alias("MWT") , 'ACTIVE').dropDuplicates()
#use flatten function to extract the fingerprint value at each index
flattened=fingerprint.select("smiles", 'HBA', 'HBD',"LOGP", "NUM_LIPINSKI_VIOLATIONS", "MWT", "ACTIVE", F.flatten(fingerprint["fingerprint"], outer=True))
array_vals=flattened.select( "smiles",'HBA', 'HBD',"LOGP", "NUM_LIPINSKI_VIOLATIONS", "MWT", "ACTIVE",flattened["value"].as_("fingerprint_number"), F.concat(F.lit('INDEX_'),flattened["index"]).as_("index"))
array_vals.show(n=20)

Ouput after calling udf and flatten function to separate array elements in udf results

Pivot each index into it’s own column with the index position and drop the smiles column

fdf=array_vals.pivot("index", ["INDEX_{}".format(i) for i in range(bit)]).sum("fingerprint_number").sort(array_vals["smiles"]).drop("SMILES")
cols_dict={fdf["'INDEX_{}'".format(idx)]: "index_{}".format(idx) for idx in range(bit)}
finaldf=fdf.rename(cols_dict)

Output after pivoting the data set: one row for each compound

3. Data Metrics: Feature Correlation

Now that we’ve transformed our features, let’s calculate the correlation using Snowpark ML’s correlation() function between each pair to better understand their relationships. The Snowpark ML library also includes a set of metrics classes that can be used as part of your feature engineering.

from snowflake.ml.modeling.metrics.correlation import correlation

corr_features_df = correlation(df=finaldf)
corr_features_df # This is a Pandas DataFrame

# Isolating bivariates that are highly correlated
melted_corr = corr_features_df.abs().unstack()
melted_corr.sort_values(ascending=True, inplace=True)
print("Bivariates that are highly correlated are as follows:")
columns_above_80 = [(col1, col2) for col1, col2 in melted_corr.index if (melted_corr[col1,col2] > 0.7 and col1 != col2)]
results = list(set(map(tuple,map(sorted,columns_above_80))))
for x in results:
  print(x)

Most correlated features of the 517 total features

4. Train: Random Forest Classification

For this prediction model we are using random forest classification. Snowpark ML supports this through Snowpark ML snowflake.ml.modeling.ensemble.RandomForestClassifier based on scikit-learn library. Therefore, the use of the library will be very similar, with the benefit of leveraging Snowflake’s elastic compute to run the model training without moving your data into a separate environment.

First we split our data set into training and testing sets and then define our input, label, and output parameters for the model.

#Split into training and testing sets
features_train_df, features_test_df = finaldf.random_split(weights=[0.80, 0.20], seed=0)
#define model parameters
INPUT=features_train_df.drop("ACTIVE").columns
LABEL=['ACTIVE']
OUTPUT=['PREDICTED_ACTIVITY']

We then run RandomForestClassifier on our training data set and save the model score.

from snowflake.ml.modeling.ensemble import RandomForestClassifier

clf = RandomForestClassifier(input_cols=INPUT,label_cols=LABEL,output_cols=OUTPUT )
clf.fit(features_train_df)

#store the model score
score=clf.score(features_train_df)
score
0.9355277064774271

Then, run the predictions on the data set

# Run Prediction on the test data
prediction_results=clf.predict(features_test_df[INPUT])
#view the results
prediction_results.select('PREDICTED_ACTIVITY').show(n=10)

Output of Predictions from test data set

5. Register Model in Model Registry

We are now ready to use Snowpark ML’s model registry to publish our model. The code below uses the registry class in Snowflake ML library and creates a model registry in the database we reference. We can then point to the new registry and register our model with a name and version as well as custom tags that describe our model further. Here we are creating tags for the model classifier type, the disease name, list of targets used, and the bit value used in the molecule fingerprinting.

from snowflake.ml.registry import model_registry
# Create a new model registry. This will be a no-op if the registry already exists.
create_result = model_registry.create_model_registry(session=session, database_name="ML_CHEMBL29_DB")
registry = model_registry.ModelRegistry(session=session, database_name="ML_CHEMBL29_DB")
# A name and model tags can be added to the model at registration time.
model_name_input="predict_activity_ad"
model_version_input="1.0"
disease_input="Alzheimer's Disease"
targets_input="CHEMBL2487,CHEMBL220,CHEMBL1914,CHEMBL2015,CHEMBL1904,CHEMBL2094124,CHEMBL4036,CHEMBL1972"
model_id = registry.log_model(model=clf, model_name=model_name_input, model_version=model_version_input, tags={
"stage": "testing","disease":disease_input, "targets": targets_input, "classifier_type": "random_forrest_classifier", "fingerprint_bit": bit }, sample_input_data=features_train_df[INPUT])

We can set the metric property of our newly registered model to track its score.

model = model_registry.ModelReference(registry=registry, model_name=model_name_input, model_version=model_version_input)
model.set_metric("model_score", score)

Finally to make the model available to be referenced by other Snowflake users with access to this registry, we need to deploy it either to a warehouse or to Snowpark Container Services. Since in this use case we want to call the model’s predict function from a Streamlit app in Snowflake, we will deploy it to a Snowflake warehouse.

model.deploy(deployment_name="AD1",
target_method="predict", # the name of the model's method, usually predict
permanent=True, options={'relax_version': True, 'replace_udf': True})
#call the deployed model's predict function
result_dataframe = model.predict("AD1", features_test_df[INPUT])

Now that we have a model available, we can clone this notebook and update the targets for a different disease, or build a model using more or less features by changing the fingerprint bit value. As we publish new versions of the model in the registry, they can be queried using the list models functions. Check out the documentation for all the available functions related to the model registry; for example, being able to delete models or list all the deployments of a model.

model_list = registry.list_models()
model_list.select("ID", "NAME", "VERSION","CREATION_TIME", "TAGS").order_by(
"CREATION_TIME", ascending=False).show()

Hopefully, you see the simplicity and seamless integration of Snowpark ML into your machine learning workflows. Combined with Streamlit in Snowflake, you can create actionable data apps, putting complex ML models into the hands of researchers and scientists.