Generating Molecules using a Char-RNN in Pytorch

Sunita Choudhary
5 min readJul 8, 2019

--

Before you dig into details of Recurrent Neural networks, if you are a Beginner I suggest you to read about RNN.

Note: To go through the article, you must have basic knowledge of neural networks and how Pytorch (a deep learning library) works. You can refer the some articles to understand these concepts:

In this post ,I am implementing a RNN model with Pytorch to generate SMILES.

Now in this we will learn-

· Why/what are Recurrent Neural Networks?

· Character level RNN model

· RNN for Molecules (SMILES) Generation

· Generating SMILES using RNN’s

Recurrent Neural Networks-The idea behind RNN’s is to make use of sequential information. RNN’s are used for sequential data, like audio or sentences, where the order of the data plays an important role.

What makes Recurrent Networks so special? The core reason that recurrent nets are more exciting is that they allow us to operate over sequences of vectors:

Recurrent Neural Network

Character-Level RNN Model:

Okay, so we have an idea about what RNNs are, why they are super exciting, and how they work. We’ll now ground this in a fun application: We’ll train RNN character-level RNN models. That is, we’ll give the RNN a huge chunk of data(Smiles representation of molecules)and ask it to model the probability distribution of the next character in the sequence given a sequence of previous characters. This will then allow us to generate new smiles one character at a time.By the way, together with this post I am also releasing code on Github-https://github.com/bayeslabs/genmol/tree/Sunita/genmol

that allows you to train char RNN model based on multi-layer LSTMs.

RNN for Molecules (SMILES) Generation-

In this Post, we want to show that recurrent neural networks can be trained as generative models for molecular structures, similar to statistical language models in natural language processing. We demonstrate that the properties of the generated molecules correlate very well with the properties of the molecules used to train the model.

To connect chemistry with language, it is important to understand how molecules are represented. Usually, they are modeled by molecular graphs, also called Lewis structures in chemistry. In molecular graphs, atoms are labeled nodes. The edges are the bonds between atoms, which are labeled with the bond order (e.g., single, double, or triple).

However, in models for natural language processing, the input and output of the model are usually sequences of single letters, strings or words. We therefore employ the SMILES (Simplified Molecular Input Line Entry System) format are the type of chemical notation that helps us to represent molecules and easy to used by the computers. It is a simple string representation of molecules, which encodes molecular graphs compactly as human-readable strings. SMILES is a formal grammar which describes molecules with an alphabet of characters, for example c and C for aromatic and aliphatic carbon atoms, O for oxygen, and −, =, and # for single, double, and triple bonds (see Figure 2).To indicate rings, a number is introduced at the two atoms where the ring is closed. For example, benzene in aromatic SMILES notation would be c1ccccc1.

Figure 2. Examples of molecule and It’s SMILES representation. To correctly create smiles, the model has to learn long-term dependencies, for example, to close rings (indicated by numbers) and brackets.

Generating SMILES using RNN’s:

I’ll be showing you how I implemented my recurrent neural network in Pytorch. I trained it using the ChEMBL smiles Dataset ,which contains 2M smiles,and it is a manually curated database of bio-active drug-like molecules.

Part 1: Importing libraries and data preprocessing -

First, we import pytorch, the deep learning library we’ll be using,also import nn (pytorch’s neural network library) and torch.nn.functional, which includes non-linear functions like ReLu and sigmoid.

Let’s load the Data file and name it as text

Then we’ll create a dictionary out of all the characters and map them to an integer. This will allow us to convert our input characters to their respective integers (char2int) and vice versa (int2char).

Finally, we’re going to convert all the integers into one-hot vectors.

we will usually want to feed training data in batches to speed up the training process,so defining method to make mini-batches for training.

Part 2: Building the Model

First, we’re going to check if we can train using the GPU, which will make the training process much quicker. If you don’t have a GPU, be forewarned that it will take a much longer time to train. Check out Google Collaboratory or other cloud computing services!

Now, we define our Char-RNN Model! .We will implement dropout for regularization and here rather than having the input sequence be in words, we’re going to look at the individual letters/characters instead

For our forward function, we’ll propagate the input and memory values through the LSTM layer to get the output and next memory values. After performing dropout, we’ll reshape the output value to make it the proper dimensions for the fully connected layer.

Finally, for initializing the hidden value for the correct batch size if you’re using mini-batches.This method generates the first hidden state of zeros which we will use in the forward pass,We will send the tensor holding the hidden state to the device we specified earlier as well.For full code visit below GitHub link.

Part 3:

We’ll declare a function, where we’ll define an optimizer(Adam) and loss (cross entropy loss). We then create the training and validation data and initialize the hidden state of the RNN. We’ll loop over the training set, each time encoding the data into one-hot vectors, performing forward and backpropagation, and updating the gradients. please for full code visit our Github profile-https://github.com/bayeslabs/genmol/tree/Sunita/genmol

we’ll have the method generate some loss statistics(training loss and validation loss) to let us know if the model is training correctly.Now, we’ll just declare the hyper parameters for our model, create an instance for it, and train it!

Part 4:The prediction task

The input to the model will be a sequence of characters(smiles), and we train the model to predict the output — Since RNN’s maintain an internal state that depends on the previously seen elements, given all the characters computed until this moment, what is the next character?After training, we’ll create a method (function) to predict the next character from the trained RNN with forward propagation.’’’ Given a character, predict the next character.
Returns the predicted character and the hidden state.

Then, we’ll define a sampling method that will use the previous method to generate an entire string of smiles, first using the characters in the first word (prime) and then using a loop to generate the next words using the top_k function, which chooses the letter with the highest probability to be next.

Finally, we just call the method, define the size you want (I chose 120 characters) and the prime (I chose ‘A’) and get the result!

Final Output-

In the next post, I will go a bit more into technical Models to generate molecules in detail and show how to implement and train a Model to generate Molecules using generative models.For professional inquiries, please contact me on -https://www.linkedin.com/in/sunita-c-b25b3187/

--

--