Cheminformatics in Snowflake: Using Rdkit & Snowpark to Analyze Molecular Data

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

6 min readApr 5, 2023

Photo by National Cancer Institute on Unsplash

Introduction:

As part of their cheminformatics workflows, many scientists have to perform intensive computations on molecular compounds they are screening. For example, scientists may want to know the molecular similarity of two compounds as part of their drug-discovery process, or want to extract certain physical properties of the compounds they are working with. To accomplish these tasks, open source libraries like RDKit are frequently used to analyze molecular data for insights.

This post will walk you through how you can get started with RDKit, and use it to analyze molecular data you have within your Snowflake account. We will make use of Snowpark for Python, which will enable you to create User-Defined-Functions (UDF’s) that are able to process molecular data at scale.

To follow along with the examples provided below, please refer to the source code and data in the below links:

For a deeper introduction to Snowpark for Python, please see my colleague Caleb Baechtold’s excellent article Operationalizing Snowpark Python to get started.

Environment Set Up:

To start making use of RDKit, we have to first set up our local development environment with the prerequisite libraries needed for our analysis.

To make python package dependency management easier with Snowflake, we have partnered with Anaconda to provide customers a convenient way to develop their code locally, and ensure that those libraries are resolved within their Snowflake account. For more details on the python packages that are supported, please refer to Snowflake Conda Channel. When queries that call python UDFs are executed within a Snowflake warehouse, Anaconda packages are installed and cached on the virtual warehouse on your behalf. For more details on using third-party packages within Snowflake, please refer to the following documentation.

We may specify the specific versions of the packages we need in an YAML file as shown below. If a specific version of the package is not specified, Snowflake will use the latest version defined in the Snowflake conda channel when called.

name: rdkit_snowpark
channels:
  - https://repo.anaconda.com/pkgs/snowflake
dependencies:
  - python=3.8
  - snowflake-snowpark-python
  - ipykernel
  - pyarrow
  - numpy==1.23.5
  - pandas==1.4.3
  - cachetools==4.2.2
  - notebook
  - rdkit==2022.09.4

Once defined, we can create our environment by using the following command: conda env create -f environment.yml . The resulting environment will enable us to make use of RDKit within our server side objects we define.

Writing UDFs to Analyze Molecular Data:

As Caleb’s article above discusses, Snowpark for Python is a developer framework that enables users to author data transformation pipelines using Python. This includes deploying native python code in server-side objects (UD(T)F’s & SPROCs) to program the data that resides within your Snowflake account.

The data we will use to create our server side objects is provided by the Chembl Database — which contains more than 2 million records on bioactive molecules with drug-like properties. After downloading the data, we can set up our database context, preprocess our input data, and start creating UDFs.

#Environment Set-Up: Fill in your creds.json file with your corresponding account credentials:  
with open('<PATH TO YOUR CREDS.JSON FILE>') as f:
    data = json.load(f)
    USERNAME = data['user']
    PASSWORD = data['password']
    SF_ACCOUNT = data['account']
    
connection_parameters = {
    "account": SF_ACCOUNT,
    "user": USERNAME,
    "password": PASSWORD,
    "role": "ACCOUNTADMIN"
}

session = Session.builder.configs(connection_parameters).create()

#Set up Database to upload data to 
session.sql('CREATE OR REPLACE DATABASE RDKIT').collect()
session.sql('CREATE OR REPLACE WAREHOUSE RDKIT_WH').collect()
session.sql('CREATE OR REPLACE STAGE TMP').collect()

df = pd.read_csv('../smiles.csv', delimiter=';')
cols = ['ChEMBL ID', 'Molecular Formula', 'Smiles']
df = df[cols]
df.dropna(inplace = True)

# Write dataframe to snowflake table:
session.write_pandas(df, table_name = 'CHEMBL_DATABASE', auto_create_table = True, overwrite = True)

Calculating Molecular Descriptors:

The Chembl database encodes molecular compounds using the SMILES string format, enabling us to encode information about a molecule’s structure using ASCII strings. As a first step, we will write a UDF to be able to extract the molecular properties of an input compound, and return it to us:

def udf_compute_molecular_properties(smile_string_col: str) -> dict: 
    """Computes molecular properties of an input smile string, 
    returns dictionary with descriptor name and its value"""
    
    #Load the libraries:
    from rdkit import Chem
    from rdkit.ML.Descriptors.MoleculeDescriptors import MolecularDescriptorCalculator
    
    #Descriptors to calculate smiles strings from: 
    descriptors = ['ExactMolWt', 'FpDensityMorgan1', 'FpDensityMorgan2', 'FpDensityMorgan3', 
               'FractionCSP3', 'HallKierAlpha', 'HeavyAtomCount', 'HeavyAtomMolWt', 'Ipc', 'Kappa1', 
               'Kappa2', 'Kappa3', 'LabuteASA', 'MaxAbsEStateIndex', 'MaxAbsPartialCharge', 'MaxEStateIndex', 
               'MaxPartialCharge', 'MinAbsEStateIndex', 'MinAbsPartialCharge', 'MinEStateIndex', 'MinPartialCharge',
               'MolLogP', 'MolMR', 'MolWt']
    
    #Compute molecular represntation
    mol = Chem.MolFromSmiles(smile_string_col)
    descriptor_calculator = MolecularDescriptorCalculator(descriptors)
    mol_property_vals = descriptor_calculator.CalcDescriptors(mol)
    return dict(zip(descriptors, mol_property_vals))

#Register the UDF:
udf_compute_molecular_properties = session.udf.register(func = udf_compute_molecular_properties, 
                                                        name = 'udf_compute_molecular_properties',
                                                        stage_location = '@TMP', 
                                                        replace = True, 
                                                        is_permanent = True, 
                                                        packages = ['cachetools==4.2.2', 'numpy==1.23.5', 'pandas==1.4.3', 'rdkit==2022.09.4'],
                                                        session = session
                                                       )

#Invoke UDF:
smiles_df.with_column('MOLECULAR_PROPERTIES', 
                       udf_compute_molecular_properties(F.col('"Smiles"'))).limit(10).show()

In terms of performance, when we invoke this UDF, we are only passing a single row of our data to the UDF at a time — it takes roughly 46 minutes using a Extra-Small warehouse to run to completion against the whole 2.3 million molecules. While using this approach may be acceptable for small to medium datasets, we may also use vectorized UDFs in combination with larger warehouse sizes to help speed up the query execution. Consider the following example:

@cachetools.cached(cache={})
def get_molecular_descriptor_calculator():
    """Cache call to retreive molecular descriptors function"""
    from rdkit.ML.Descriptors.MoleculeDescriptors import MolecularDescriptorCalculator
    
    descriptor_vals = ['ExactMolWt', 'FpDensityMorgan1', 'FpDensityMorgan2', 'FpDensityMorgan3', 
               'FractionCSP3', 'HallKierAlpha', 'HeavyAtomCount', 'HeavyAtomMolWt', 'Ipc', 'Kappa1', 
               'Kappa2', 'Kappa3', 'LabuteASA', 'MaxAbsEStateIndex', 'MaxAbsPartialCharge', 'MaxEStateIndex', 
               'MaxPartialCharge', 'MinAbsEStateIndex', 'MinAbsPartialCharge', 'MinEStateIndex', 'MinPartialCharge',
               'MolLogP', 'MolMR', 'MolWt']
    
    return MolecularDescriptorCalculator(descriptor_vals), descriptor_vals


def vudf_compute_molecular_properties(smile_string_df: T.PandasDataFrame[str]) -> T.PandasSeries[dict]: 
    """Vectorized implementation of the compute molecular properties, for more efficient processing"""
        
    #Load the libraries:
    from rdkit import Chem
    #from rdkit.ML.Descriptors.MoleculeDescriptors import MolecularDescriptorCalculator
    
    smile_string_df.columns = ['SMILES']
    descriptor_calculator, descriptors = get_molecular_descriptor_calculator()
    
    def smiles_to_descriptors(smile):
        """Helper function to apply to batches of rows"""
        mol = Chem.MolFromSmiles(smile)
        mol_property_vals = descriptor_calculator.CalcDescriptors(mol)
        return dict(zip(descriptors, mol_property_vals))
    
    #smile_string_df['descriptors'] = smile_string_df.SMILES.apply(smi_to_descriptors)
    return smile_string_df.SMILES.apply(smiles_to_descriptors)

#Register the vectorized udf:
vudf_compute_molecular_properties = session.udf.register(func = vudf_compute_molecular_properties, 
                                                        name = 'vudf_compute_molecular_properties',
                                                        stage_location = '@TMP', 
                                                        replace = True, 
                                                        is_permanent = True, 
                                                        packages = ['cachetools==4.2.2', 'numpy==1.23.5', 'pandas==1.4.3', 'rdkit==2022.09.4'],
                                                        session = session,
                                                        max_batch_size = 500
                                                       )
#invoke the vectorized udf
smiles_df.with_column('MOLECULAR_PROPERTIES', 
                       vudf_compute_molecular_properties(F.col('"Smiles"'))).limit(10).show()

After rewriting the UDF to its vectorized/batch equivalent, and making use of a Medium sized warehouse, the query takes 12.5 minutes to complete. As a rule of thumb, as your dataset size scales, consider making use of vectorized UDFs to more efficiently process all of your input data; in addition to increasing your warehouse size to allocate more resources to the query.

Molecular Similarity:

In addition to extracting molecular properties, we may often want to understand the extent to which two molecules we are studying are similar to one another. Building on the assumption that molecules with similar structures exhibit similar behaviors, being able to screen large sets of molecules for their similarities let us potentially identify new bioactive molecules.

In the example below, we will define a UDF that computes the similarity between two molecules by first computing the molecule’s MACCS fingerprint, before using the Tanimoto coefficient between the pair to compute its similarity. We will leverage the molecule pairs found here, as covered in Greg Landrum’s blog post.

#Load the data: 
pairs_df = pd.read_table('<YOUR_FILE_PATH>/chembl16_25K.pairs.txt', delimiter= ' ',
              names = ['pair1_id', 'smile_string1', 'pair2_id', 'smile_string2'])

session.write_pandas(pairs_df, table_name = "CHEMBL16_PAIRS", auto_create_table=True, overwrite=True)

#Define the UDF: 
def udf_compute_maccs_tanimoto_similarity(smile_string_col1: str, smile_string_col2: str) -> float:
    """Computes the tanimoto_similarity of two smile string compounds using the MACCS fingerprint"""
    
    #Load libraries
    from rdkit import Chem
    from rdkit.Chem import MACCSkeys
    from rdkit import DataStructs
    
    #Compute molecular representation
    mol1 = Chem.MolFromSmiles(smile_string_col1)
    mol2 = Chem.MolFromSmiles(smile_string_col2)
    
    #Generate the maccs keys for smile string
    maccs1 = MACCSkeys.GenMACCSKeys(mol1)
    maccs2 = MACCSkeys.GenMACCSKeys(mol2)
    
    return DataStructs.TanimotoSimilarity(maccs1, maccs2)

#Register the UDF: 
udf_maccs_tanimoto_similarity = session.udf.register(func = udf_compute_maccs_tanimoto_similarity, 
                                                        name = 'udf_compute_maccs_tanimoto_similarity',
                                                        stage_location = '@TMP', 
                                                        replace = True, 
                                                        is_permanent = True, 
                                                        packages = ['cachetools==4.2.2', 'numpy==1.23.5', 'pandas==1.4.3', 'rdkit==2022.09.4'],
                                                        session = session
                                                       )

#Call the UDF:
pairs_sdf = session.table('CHEMBL16_PAIRS')
pairs_sdf.with_column('MACCS_Tanimoto_similarity', udf_maccs_tanimoto_similarity(F.col('"smile_string1"'), 
                                                                                    F.col('"smile_string2"')))\
             .limit(100).show()

As a final example, we may often want to know if a specific substructure of interest exists within a compound, for screening purposes. In the UDF defined below, we can pass in a specific pattern we want to search against, and determine if it exists within the database of our molecules:

def udf_has_substructure_match(smile_string_col: str, pattern: str) -> bool:
    """Given a molecule represented as a smile string and a user defined pattern, 
    will return True if pattern is within the input smile string"""
    
    #Load libraries
    from rdkit import Chem
    
    #Compute molecular representation
    mol = Chem.MolFromSmiles(smile_string_col)
    patt = Chem.MolFromSmarts(pattern)
    
    return mol.HasSubstructMatch(patt)

#Register the UDF:
udf_has_substructure_match = session.udf.register(func = udf_has_substructure_match, 
                                                        name = 'udf_has_substructure_match',
                                                        stage_location = '@TMP', 
                                                        replace = True, 
                                                        is_permanent = True, 
                                                        packages = ['cachetools==4.2.2', 'numpy==1.23.5', 'pandas==1.4.3', 'rdkit==2022.09.4'],
                                                        session = session
                                                       )

smiles_df.with_column('has_substructure', 
                       udf_has_substructure_match(F.col('"Smiles"'), F.lit('ccO'))).limit(10).show()

Wrap Up:

In this article, we covered how you are able to install RDKit into your local development environment to analyze molecular data that resides within your Snowflake account. In the subsequent UDF’s we defined, we were able to bring RDKit as the prerequisite library, code our core logic in pure python, and register them in Snowflake — all without having to worry about any of the underlying infrastructure or dependency management requirements.

Cheminformatics in Snowflake: Using Rdkit & Snowpark to Analyze Molecular Data

Introduction:

Environment Set Up:

Writing UDFs to Analyze Molecular Data:

Written by Harsh Patel