Drug Discovery “Lipophilicity” using Open VINO toolkit

Abhishek Nandy
19 min readNov 19, 2023

What is this article all about?

This article is about finding or providing a valuable tool in the field of Bioinformatics, drug discovery and protein analysis using Intel’s Open VINO toolkit.

What are we going to do?

Finding Lipophilicity of peptides, proteins and molecules.

Lipophilicity of Peptides and Proteins

Definition

- Lipophilicity refers to the tendency of a compound, such as a peptide or protein, to dissolve in fats, oils, and lipids, as opposed to water. It’s a measure of a molecule’s affinity for a lipophilic (fat-loving) environment versus a hydrophilic (water-loving) one.

Significance in Peptides and Proteins

- Peptides and proteins are chains of amino acids, and their lipophilicity is determined by the nature and sequence of these amino acids. The side chains of some amino acids are hydrophobic and tend to increase the lipophilicity of the molecule, while others are hydrophilic and decrease it.

What the Lipophilicity Value Signifies

- Balance Between Hydrophobic and Hydrophilic Interactions: The lipophilicity value of a peptide or protein gives insight into how it will interact with other molecules. Proteins with higher lipophilicity tend to interact more with lipid membranes and less with aqueous environments.

- Solubility and Stability: Lipophilicity affects a molecule’s solubility, stability, and overall conformation in different environments. This is crucial in understanding how a protein or peptide behaves in biological systems.

- Transport and Bioavailability: For drugs, lipophilicity is a key factor in determining how well they are absorbed, distributed, and reach their target site in the body. Lipophilic drugs tend to cross cell membranes more easily.

Role in Drug Discovery

- Target Interaction: Drugs are often designed to interact with specific proteins. Understanding the lipophilicity of these target proteins helps in designing drugs that have the right balance of lipophilic and hydrophilic properties for optimal interaction.

- Improving Drug Efficacy and Safety: Drugs with appropriate lipophilicity are more likely to reach their target, bind effectively, and exert their therapeutic effect while minimizing off-target effects and toxicity.

- Formulation and Delivery: Knowledge of lipophilicity aids in drug formulation, ensuring that the drug remains stable, soluble, and effective throughout its shelf life and inside the body.

- Predicting Drug Behavior: Lipophilicity is used in pharmacokinetic modeling to predict how a drug will be absorbed, distributed, metabolized, and excreted — crucial for understanding its overall behavior in the body.

Updates

- Integral in Rational Drug Design: Lipophilicity is a crucial parameter in the rational design of peptides and proteins as therapeutic agents. It influences how well a drug can reach its target, its interactions with the target, and its overall pharmacokinetic and pharmacodynamic profiles.

  • Balancing Act: In drug discovery, achieving the right balance of lipophilicity is key to developing effective and safe drugs. Too lipophilic, and the drug may be poorly soluble and have high toxicity; too hydrophilic, and it may fail to cross cell membranes to reach its target.

What we are trying to achieve

Applying PyTorch for modeling and then using the Intel OpenVINO toolkit for inference is a feasible approach for a wide range of machine learning tasks, including those related to bioinformatics and molecular biology. However, the process of modeling lipophilicity of proteins and using OpenVINO for inference involves several steps:

1. Data Preparation and Feature Extraction: You need a dataset that represents proteins and their lipophilic properties. This could include molecular descriptors, 3D structures, and known lipophilicity measurements. Feature extraction in the context of proteins might involve calculating various physicochemical properties or using techniques like molecular fingerprints.

2. Model Development in PyTorch: With the prepared dataset, you can develop a machine learning model using PyTorch. This might involve a regression model if you’re predicting a continuous measure of lipophilicity, or a classification model for categorizing proteins based on their lipophilicity. Deep learning approaches, especially those leveraging convolutional neural networks (CNNs), can be effective if you are working with 3D structural data of proteins.

3. Training and Validation: Train your model on a portion of your data and validate its performance using a separate validation set. It’s crucial to ensure that the model generalizes well and isn’t just memorizing the training data.

4. Model Conversion for OpenVINO: Once you have a trained PyTorch model, you’ll need to convert it into a format compatible with OpenVINO. OpenVINO typically uses the ONNX (Open Neural Network Exchange) format or its own Intermediate Representation (IR) format. You can convert the PyTorch model to ONNX and then use OpenVINO’s Model Optimizer to convert it to IR.

5. Inference with OpenVINO: With the model converted to OpenVINO’s format, you can now run inference efficiently, especially on Intel hardware. OpenVINO is designed to optimize performance by utilizing Intel CPUs, GPUs, and other hardware accelerators.

6. Analysis and Interpretation: Finally, analyze the output from the inference process. In the context of lipophilicity, this might involve interpreting how the model’s predictions correlate with known lipophilic properties and understanding the biological significance of these predictions.

Let’s analyze the notebook

%pip install -q "openvino>=2023.1.0"

By running this command in a Jupyter notebook, you instruct the notebook environment to install the specified version (or newer) of the OpenVINO toolkit, while keeping the output minimal. It’s a common way to set up the necessary software environment within a notebook before running code that depends on those packages.

pip install rdkit-pypi

By running pip install rdkit-pypi, you are installing RDKit into your Python environment, making it available for use in cheminformatics tasks like molecule manipulation, chemical file reading and writing, calculating molecular descriptors, molecular similarity, etc. This is particularly useful for scientists and developers working in the field of computational chemistry, drug discovery, and materials science.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import PandasTools
from rdkit import RDConfig
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import os
import pandas as pd

The code includes import statements from various Python libraries, each serving different purposes in a programming context, especially in the fields of cheminformatics and machine learning. Here’s an explanation of each import:

1. `from rdkit import Chem`:

- Imports the `Chem` module from the RDKit library. RDKit is a widely used toolkit for cheminformatics. The `Chem` module provides classes and functions for handling and manipulating chemical structures.

2. `from rdkit.Chem import AllChem`:

- Imports the `AllChem` module from RDKit’s `Chem` package. `AllChem` provides a wide range of functions for chemical informatics, including molecule conversion, substructure searching, and molecular descriptor calculation.

3. `from rdkit.Chem import PandasTools`:

- Imports the `PandasTools` module from RDKit, which provides functions to integrate RDKit with pandas DataFrames. It is useful for managing chemical data within pandas, including rendering molecule structures within the DataFrame.

4. `from rdkit import RDConfig`:

- Imports the `RDConfig` module from RDKit. `RDConfig` contains configuration variables for RDKit, such as directory paths to data files and environmental settings.

5. `import torch`:

- Imports PyTorch, a popular deep learning library. PyTorch provides a flexible platform for building and training neural networks, with strong GPU acceleration support.

6. `import torch.nn as nn`:

- Imports the `nn` module from PyTorch as `nn`. This module provides the building blocks for creating neural networks, like layers, activation functions, and loss functions.

7. `import torch.optim as optim`:

- Imports the `optim` module from PyTorch as `optim`. This module includes optimization algorithms like SGD, Adam, etc., used for training neural networks.

8. `import numpy as np`:

- Imports the NumPy library as `np`. NumPy is fundamental for scientific computing in Python, offering powerful data structures for efficient computation with arrays and matrices.

9. `from sklearn.model_selection import train_test_split`:

- Imports the `train_test_split` function from scikit-learn’s `model_selection` module. This function is used to easily split datasets into training and test sets.

10. `from sklearn.metrics import mean_squared_error`:

- Imports the `mean_squared_error` function from scikit-learn’s `metrics` module. This function is used to calculate the mean squared error (MSE) between actual and predicted values, a common metric for regression models.

11. `import os`:

- Imports Python’s built-in `os` module, which provides functions for interacting with the operating system, like file path manipulation, directory management, and environment variable access.

12. `import pandas as pd`:

- Imports the pandas library as `pd`. Pandas is an essential data analysis and manipulation library, offering powerful data structures like DataFrames for handling tabular data.

Together, these imports suggest that the code is likely involved in a cheminformatics project, possibly involving machine learning or deep learning for chemical data analysis or predictive modeling. The use of RDKit indicates chemical data manipulation, while PyTorch and scikit-learn are indicative of machine learning model development and evaluation.

# Replace 'path_to_file.tsv' with the actual file path
file_path = 'logd74.tsv'

# Load the dataset
df = pd.read_csv(file_path, delimiter='\t')

The code snippet you’ve provided is performing the following tasks:

1. Defining a File Path:

- `file_path = ‘logd74.tsv’`: This line sets the variable `file_path` to the string `’logd74.tsv’`. This string is assumed to be the name of a file that contains the data you want to work with. The comment above the line suggests that `’logd74.tsv’` should be replaced with the actual path to the file you intend to load.

2. Loading a Dataset with pandas:

- `df = pd.read_csv(file_path, delimiter=’\t’)`: This line uses the pandas library (imported as `pd` in the previous imports you mentioned) to load a dataset from the file located at `file_path`.

- `pd.read_csv()`: This function is a common way to read data into a pandas DataFrame from a CSV (Comma-Separated Values) file. Despite its name, it can read files with various delimiters, not just commas.

- `delimiter=’\t’`: This parameter specifies that the delimiter in the file is a tab character (`\t`). This is typical of TSV (Tab-Separated Values) files, which is consistent with the file extension `.tsv` in `logd74.tsv`.

- The resulting DataFrame, `df`, will contain the data from `logd74.tsv`, with each row corresponding to a line in the file, and columns determined based on the tab delimiter.

In summary, this code is intended to read a dataset from a TSV file named `logd74.tsv` into a pandas DataFrame for further analysis or processing. The dataset is expected to be in a tab-separated format.

Moving to the part after training

Here we are converting the pytorch model to onnx format

import torch.onnx

dummy_input = torch.randn(1, 2048) # Adjust the size according to your model input
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11)

The provided code is using PyTorch to export a trained neural network model to the ONNX (Open Neural Network Exchange) format. Here’s a breakdown of each step:

### Importing `torch.onnx`

- `import torch.onnx`:

- This line imports the `onnx` module from PyTorch, which provides necessary functions to export PyTorch models to the ONNX format. ONNX is a popular open-format used to represent deep learning models and allows for model interchange between various deep learning frameworks.

Preparing a Dummy Input

- `dummy_input = torch.randn(1, 2048)`:

- This creates a dummy input tensor using PyTorch’s `randn` function, which generates a tensor with random numbers drawn from a standard normal distribution.

- The size of the tensor `(1, 2048)` should match the input size that the model expects. In this case, it seems to be a single input (hence `1`) with 2048 features. This dummy input is used during the export process to trace the model’s operations.

### Exporting the Model to ONNX Format

- `torch.onnx.export(model, dummy_input, “model.onnx”, opset_version=11)`:

- `torch.onnx.export(…)` is the function used to export the PyTorch model to ONNX format.

- `model` is the PyTorch model that you want to export.

- `dummy_input` is the input tensor that will be passed through the model. This is necessary because ONNX needs to trace the operations performed by the model.

- `”model.onnx”` is the filename where the ONNX model will be saved. The `.onnx` file extension is standard for models in this format.

- `opset_version=11` specifies the version of the ONNX operator set to use. Different versions of the operator set might have different capabilities. It’s important to choose a version supported by the frameworks and tools you plan to use with the ONNX model.

This code is exporting a PyTorch neural network model to the ONNX format, enabling the model to be used in different deep learning frameworks that support ONNX. This is useful for deployment or for running inference in environments where PyTorch is not the preferred framework. The use of a dummy input allows ONNX to understand the operations and layer connections within the model.

The Open VINO part

# Create OpenVINO Core object instance
core = ov.Core()

# Read the ONNX model
ov_model = core.read_model("model.onnx")

# (Optional) Perform any necessary optimizations
compiled_model = core.compile_model(ov_model, "CPU")

The provided code is using the OpenVINO toolkit, a library developed by Intel for optimizing and deploying deep learning models, especially for inference. Here’s a breakdown of each step in the code:

1. Creating an OpenVINO Core Object Instance

- `core = ov.Core()`:

- This line creates an instance of the OpenVINO Core object. The Core object is a central entity in the OpenVINO runtime that allows you to work with models and perform various operations, like reading, loading, and compiling models for inference.

2. Reading the ONNX Model

- `ov_model = core.read_model(“model.onnx”)`:

- Here, the `read_model` method of the Core object is used to read the ONNX model saved previously in the file `”model.onnx”`.

- This method loads the model into an OpenVINO model object (`ov_model`). The model is now in a format that OpenVINO can work with, but it is not yet optimized for inference.

3. Optional Optimizations and Compiling the Model

- `compiled_model = core.compile_model(ov_model, “CPU”)`:

- This line compiles the loaded model for a specific hardware target, in this case, a CPU.

- The `compile_model` method optimizes the model for the specified hardware, potentially improving performance during inference. This step is crucial for leveraging hardware-specific optimizations that OpenVINO offers, particularly for Intel CPUs, GPUs, and other accelerators.

- The resulting `compiled_model` is an optimized, executable representation of the original deep learning model. It can be used to run inference efficiently on the specified hardware (CPU in this instance).

This code demonstrates how to use OpenVINO to load, optimize, and compile a deep learning model (originally in ONNX format) for efficient inference on a CPU. This process is integral to deploying deep learning models in production, especially in scenarios where inference speed and efficiency are critical. OpenVINO is particularly effective when used with Intel hardware, offering significant performance improvements.

MODEL_DIR = '/content/sample_data'  # Specify your directory as a string
MODEL_NAME = "lipophilicity_openvino"

# Ensure that `ov_model` is the original OpenVINO model object
# Save the OpenVINO model to disk
ov.save_model(ov_model, MODEL_DIR + "/" + f"{MODEL_NAME}.xml")

The code snippet you’ve provided is performing the task of saving an OpenVINO model to disk. Here’s a breakdown of what each part of the code is doing:

1. Setting Directory and Model Name

- `MODEL_DIR = ‘/content/sample_data’`:

- This line sets a variable `MODEL_DIR` to the string `’/content/sample_data’`. This string represents a file path where the model will be saved. The path appears to be structured for a Google Colab environment, as indicated by the `/content` prefix.

- `MODEL_NAME = “lipophilicity_openvino”`:

- Here, `MODEL_NAME` is set to `”lipophilicity_openvino”`. This is the name that will be given to the saved model file.

2. Saving the OpenVINO Model

- `ov.save_model(ov_model, MODEL_DIR + “/” + f”{MODEL_NAME}.xml”)`:

- `ov.save_model(…)`: This function is used to save an OpenVINO model to disk. The model must be an OpenVINO model object, which is indicated by the comment in the code.

- `ov_model`: This is the OpenVINO model object that you want to save. It should already be loaded or created in a previous step of your code.

- `MODEL_DIR + “/” + f”{MODEL_NAME}.xml”`: This is the path where the model will be saved. It concatenates the directory path (`MODEL_DIR`), a forward slash (acting as a directory separator), and the model name with the `.xml` extension. The `.xml` extension is used for OpenVINO models, which are saved in XML format.

- In the final form, it creates a path like `/content/sample_data/lipophilicity_openvino.xml`.

This code saves an OpenVINO model to a specified directory with a specified name. The saved model file will be in XML format, which is standard for OpenVINO models. This is useful for persisting trained models, sharing them, or deploying them in different environments where OpenVINO is used for inference. The model can later be loaded from this file for performing inference tasks.

# Load OpenVINO model on device
compiled_model = core.compile_model(ov_model, device.value)
compiled_model

The code snippet you’ve provided is part of a process for compiling an OpenVINO model for a specific hardware device using the OpenVINO toolkit. Here’s a breakdown of what it’s doing:

Loading and Compiling the OpenVINO Model for a Specific Device

1. `compiled_model = core.compile_model(ov_model, device.value)`:

- `core.compile_model(…)`: This method is called on an OpenVINO Core object (`core`) that you would have previously instantiated. The method compiles the model for optimized inference on a specified hardware device.

- `ov_model`: This represents the OpenVINO model object you want to compile. This model would have been previously loaded or converted into OpenVINO’s format.

- `device.value`: This specifies the hardware device you want to compile the model for. The `device` here is expected to be an object or variable that holds information about the target device for model compilation. The `.value` attribute is expected to contain the device identifier string (e.g., `’CPU’`, `’GPU’`, `’MYRIAD’` for VPU, etc.). The actual hardware device used will depend on what is available and compatible in your system and what `device.value` is set to.

2. `compiled_model`:

- After the model is compiled, it is stored in the variable `compiled_model`. This compiled model is an optimized version of your original model, tailored for efficient inference on the specified hardware device.

This code is part of a workflow in OpenVINO to optimize and compile a deep learning model for efficient inference on a specific hardware device. The compiled model (`compiled_model`) can then be used to perform inference tasks with improved performance, taking advantage of hardware-specific optimizations offered by OpenVINO. The actual device for compilation is determined by `device.value`, which should be set to the identifier of the desired inference hardware.

# Assuming you have a function to convert SMILES to fingerprints
def smiles_to_fp(smiles, n_bits=2048):
mol = Chem.MolFromSmiles(smiles)
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=n_bits)
return np.array(fp)

# Example SMILES string
smiles = "C[C@H](N)C(=O)O" # Replace with your SMILES string

# Prepare input tensor
fp = smiles_to_fp(smiles)
input_tensor = torch.tensor(fp, dtype=torch.float32).unsqueeze(0) # Adding batch dimension

# Convert PyTorch tensor to NumPy array
input_numpy = input_tensor.numpy()

# Create OpenVINO tensor from NumPy array
ov_input_tensor = ov.Tensor(input_numpy)

# Run model inference
result = compiled_model([ov_input_tensor])[0]

# Postprocess and display the result
predicted_lipophilicity = result[0] # Assuming the model outputs a single value
print(f"Predicted Lipophilicity: {predicted_lipophilicity}")

The provided code snippet is executing a series of steps to perform model inference using OpenVINO, based on input derived from a chemical structure represented as a SMILES string. Here’s a detailed breakdown:

1. Defining a Function to Convert SMILES to Fingerprints

- `def smiles_to_fp(smiles, n_bits=2048)`:

- This function, `smiles_to_fp`, is defined to convert a SMILES (Simplified Molecular Input Line Entry System) string to a molecular fingerprint.

- `mol = Chem.MolFromSmiles(smiles)`: Converts the SMILES string into an RDKit molecule object.

- `fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=n_bits)`: Generates a Morgan fingerprint for the molecule, which is a type of circular fingerprint used in cheminformatics.

- `return np.array(fp)`: Converts the fingerprint to a NumPy array and returns it.

2. Preparing the Input for the Model

- `smiles = “C[C@H](N)C(=O)O”`:

- This line defines a SMILES string representing a specific molecule.

- `fp = smiles_to_fp(smiles)`:

- Converts the SMILES string to its fingerprint representation using the defined function.

- `input_tensor = torch.tensor(fp, dtype=torch.float32).unsqueeze(0)`:

- Converts the fingerprint array to a PyTorch tensor with the appropriate data type (`float32`).

- `.unsqueeze(0)` adds an additional dimension to the tensor, effectively creating a batch dimension. This is necessary because models typically expect inputs in batch format, even if there’s only one item in the batch.

3. Preparing the Input for OpenVINO Inference

- `input_numpy = input_tensor.numpy()`:

- Converts the PyTorch tensor to a NumPy array. This is necessary for compatibility with OpenVINO, which accepts NumPy arrays as input.

- `ov_input_tensor = ov.Tensor(input_numpy)`:

- Creates an OpenVINO tensor from the NumPy array. This tensor is compatible with the OpenVINO runtime and can be used for inference.

4. Running Model Inference

- `result = compiled_model([ov_input_tensor])[0]`:

- Runs inference on the `compiled_model` using the prepared OpenVINO tensor.

- The `[0]` at the end extracts the first (and presumably only) output from the result.

5. Postprocessing and Displaying the Result

- `predicted_lipophilicity = result[0]`:

- Extracts the predicted lipophilicity value from the result. This assumes that the model outputs a single value representing the lipophilicity.

- `print(f”Predicted Lipophilicity: {predicted_lipophilicity}”)`:

- Prints the predicted lipophilicity, displaying the result of the model inference.

This code is a complete workflow for taking a chemical structure in SMILES format, converting it into a format suitable for a neural network model (a molecular fingerprint), and then using that fingerprint to perform inference with an OpenVINO-compiled model to predict a property of the molecule (in this case, lipophilicity). The process includes data conversion steps necessary to interface between different libraries and frameworks (RDKit, PyTorch, and OpenVINO).

from rdkit import Chem
from rdkit.Chem import Draw
import numpy as np
import torch
import openvino.runtime as ov

# Load your PyTorch model (assuming it's already trained and saved)
model = Net() # Replace with your model class
model.load_state_dict(torch.load('lipophilicity_model.pth'))
model.eval()

# OpenVINO setup (assuming you have already converted your model)
core = ov.Core()
ov_model = core.read_model('/content/sample_data/lipophilicity_openvino.xml')
compiled_model = core.compile_model(ov_model, "CPU")

def predict_and_visualize(smiles):
# Convert SMILES to fingerprint
fp = smiles_to_fp(smiles)
input_tensor = torch.tensor(fp, dtype=torch.float32).unsqueeze(0)
input_numpy = input_tensor.numpy()

# Create OpenVINO tensor
ov_input_tensor = ov.Tensor(input_numpy)

# Run model inference
result = compiled_model([ov_input_tensor])[0]
predicted_lipophilicity = result[0]

# Visualize molecule
mol = Chem.MolFromSmiles(smiles)
img = Draw.MolToImage(mol)

return predicted_lipophilicity, img

# Example usage
smiles_list = ["C[C@H](N)C(=O)O", "CCO", "CCN(CC)CC"] # Replace with your SMILES strings
for smiles in smiles_list:
lipophilicity, img = predict_and_visualize(smiles)
print(f"SMILES: {smiles}, Predicted Lipophilicity: {lipophilicity}")
display(img)

The code snippet you’ve provided sets up a complete workflow for predicting a chemical property (lipophilicity) of molecules represented by SMILES (Simplified Molecular Input Line Entry System) strings, using both a PyTorch model and an OpenVINO-optimized model. It also includes a visualization of the molecules. Here’s a breakdown of what each part of the code is doing:

1. Importing Necessary Libraries

- The code imports necessary modules from RDKit (a cheminformatics software), NumPy, PyTorch, and OpenVINO.

2. Loading and Setting Up the PyTorch Model

- The PyTorch model (`Net`) is loaded with its trained state and set to evaluation mode. This model is assumed to be trained to predict the lipophilicity of molecules based on their fingerprints.

3. Setting Up the OpenVINO Model

- An OpenVINO Core object is created, and the pre-converted ONNX model is read and compiled for inference on a CPU. This step optimizes the model for efficient execution on the specified hardware.

4. Defining the Prediction and Visualization Function

- `def predict_and_visualize(smiles)`:

- This function takes a SMILES string as input.

- The SMILES string is converted to a molecular fingerprint, which is then converted into a tensor compatible with both PyTorch and OpenVINO.

- Inference is run on the compiled OpenVINO model using the fingerprint.

- The molecular structure represented by the SMILES string is visualized using RDKit’s drawing tools.

- The function returns the predicted lipophilicity and the image of the molecule.

5. Running Predictions on a List of SMILES Strings

- The code iterates over a list of SMILES strings (`smiles_list`), predicts the lipophilicity for each molecule using the `predict_and_visualize` function, and prints the results. It also displays the visual representation of each molecule.

This code integrates cheminformatics (RDKit), machine learning (PyTorch), and model optimization and inference (OpenVINO) to predict a chemical property (lipophilicity) from molecular structures (SMILES). It demonstrates a sophisticated use case involving the intersection of chemistry and artificial intelligence, showcasing the capabilities of these libraries for tasks in computational chemistry and drug discovery.

Let’s create the web app

import streamlit as st
from rdkit import Chem
from rdkit.Chem import Draw, AllChem
from PIL import Image
import numpy as np
import torch
import openvino.runtime as ov

# Define the function to convert SMILES to fingerprints
def smiles_to_fp(smiles, n_bits=2048):
mol = Chem.MolFromSmiles(smiles)
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=n_bits)
return np.array(fp)

# Load the OpenVINO model (update the path as needed)
model_path = 'lipophilicity_openvino.xml' # Update this path
core = ov.Core()
compiled_model = core.compile_model(model_path, "CPU")

# Define the prediction function
def predict_lipophilicity(smiles):
fp = smiles_to_fp(smiles)
input_tensor = torch.tensor(fp, dtype=torch.float32).unsqueeze(0)
input_numpy = input_tensor.numpy()

# Create OpenVINO tensor from NumPy array
ov_input_tensor = ov.Tensor(input_numpy)

# Run model inference
result = compiled_model([ov_input_tensor])[0]
return result[0]

# Streamlit User Interface
st.title('Lipophilicity Prediction App')
st.write('Select a SMILES string to predict its lipophilicity and visualize the molecule.')

# Example SMILES strings
# smiles_options = ["C[C@H](N)C(=O)O", "CCO", "CCN(CC)CC", ...] # Add your SMILES strings here
smiles_options = [
"C[C@H](N)C(=O)O", "CCO", "CCN(CC)CC", "CC(=O)O", "C1=CC=C(C=C1)C(=O)O",
"C1CCC(CC1)N", "CC(C(=O)O)N", "C1CCCCC1", "C1=CC=CC=C1", "C1=CN=C(N=C1)N",
"C1CC1", "C1=CC=C(C=C1)O", "C1=CN=CN1", "C1=CC=C(C=C1)N", "C1=CC=CC=C1N",
"C1CCC(CC1)O", "C1=CC=C(C=C1)Cl", "C1=CN=C(N=C1)N", "C1CCNCC1", "C1=CC=C(C=C1)F"
]


# Dropdown for SMILES selection
selected_smiles = st.selectbox("Select a SMILES String", smiles_options)

# Button to make a prediction
if st.button('Predict Lipophilicity'):
predicted_lipophilicity = predict_lipophilicity(selected_smiles)
st.write(f"Predicted Lipophilicity: {predicted_lipophilicity}")

# Visualize the molecule
mol = Chem.MolFromSmiles(selected_smiles)
mol_image = Draw.MolToImage(mol)
st.image(mol_image, caption='Molecular Structure')

This Streamlit web application allows users to select a chemical compound represented by a SMILES string, predicts its lipophilicity using a pre-trained machine learning model (optimized with OpenVINO for CPU-based inference), and visualizes the molecular structure of the compound. The app demonstrates an interesting use case of combining cheminformatics, machine learning, and web development for a scientific application.

What is the webapp doing

The described web application is a tool designed for predicting the lipophilicity of chemical compounds and visualizing their molecular structures. It uses a combination of cheminformatics, machine learning, and web development technologies. Here’s a summary of its functionality and components:

Core Functionalities of the Web App

1. Predicting Lipophilicity:

- Users can input a chemical compound in the form of a SMILES (Simplified Molecular Input Line Entry System) string. SMILES is a notation that encodes the structure of chemical compounds as text strings.

- The application then predicts the lipophilicity of the chosen compound. Lipophilicity, which is the ability of a chemical compound to dissolve in fats, oils, and non-polar solvents (as opposed to water), is an important property in pharmacology and chemistry.

2. Visualizing Molecular Structures:

- The app also provides a visual representation of the molecular structure of the input compound. This is helpful for users to understand and verify the chemical structure of the compound they are analyzing.

Technical Components

1. Streamlit for Web Interface:

- The app is built using Streamlit, a Python library that simplifies the process of creating and deploying web applications. Streamlit is particularly popular in data science for quickly turning data scripts into shareable web apps.

2. RDKit for Cheminformatics:

- RDKit, a collection of cheminformatics and machine learning tools, is used for processing chemical information. In this app, it converts SMILES strings to molecular fingerprints (numerical representations) and generates images of the molecular structures.

3. OpenVINO for Model Inference:

- Intel’s OpenVINO toolkit is employed for running the machine learning model that predicts lipophilicity. The model seems to be pre-trained and optimized for efficient performance on CPUs, which is a typical use case for OpenVINO.

4. PyTorch and NumPy for Data Handling:

- PyTorch is used for handling tensor operations, and NumPy for numerical operations. These libraries are integral in transforming the molecular data into a format suitable for the machine learning model.

User Interaction

- Users interact with the app through a simple and intuitive interface. They can select a compound from a dropdown menu of predefined SMILES strings or potentially input their own (depending on app configuration). After selection, the app displays the predicted lipophilicity and a visual of the compound’s structure.

Conclusion

This web application demonstrates an innovative intersection of chemistry, machine learning, and web technology. It serves as a practical tool for chemists, pharmacologists, or anyone interested in the study of chemical compounds, allowing them to quickly predict and visualize important properties of molecules.

Video Link for the Web app

Full Code

--

--

Abhishek Nandy

Chief Data Scientist PrediQt |Intel Certified oneAPI Instructor|Thinker|Innovator