BioAI - Medium

Graph Convolutional Networks

Chunkyun — Thu, 23 Jan 2020 00:25:02 GMT

Graph Neural Networks(GNN)는 drug discovery 인공지능의 수준을 올리는 데 많은 기여를 하고 있습니다. 그 동안, 약을 인공지능이 이해할 수 있게 feature로 표현하는 방법 중에 Extended-Connectivity Fingerprints(ECFPs)가 많이 이용되어 왔습니다. 하지만 Steven Kearnes et al., 2016에 의하면, 약을 그래프로 표현하는 것이 ECFPs 보다 반드시는 아니지만 대체로 좋은 성능을 보이고 있습니다.

Fig 1. CVPR 2019 Top 25 keywords

위 사진에서 볼 수 있듯이 “graph”는 CVPR 2019 상위 키워드 15위로 등극하였습니다(CVPR 2018은 55위). 그 만큼 GNN에 많은 연구가 이루어지고 있고, 이를 drug discovery 분야에도 적용한다면 많은 발전이 있을 것이라 예상합니다. 그 중, GNN을 대표하는 모델 중 하나인 Graph Convolutional Networks(GCN)에 대해 알아보도록 하겠습니다.

What Is Graph?

Fig 2. Graph(left), Adjacency matrix(right)

GNN에서 말하는 그래프에 대해 간략하게 설명하겠습니다. 그래프는 우선 다음 두가지로 이루어져 있습니다.

Node(Vertex) : Fig 2. 왼쪽 그림에서 원으로 표시 된 a, b, c, d, e ,f를 node라 합니다.
Edge : 두 vertices를 연결한 선을 의미합니다.

약에서 nodes는 원소들을, edges는 결합 방법(single, double, triple, aromatic 등)을 의미합니다.

또한, 그래프는 Fig 2. 우측 그림과 같은 인접 행렬(Adjacency matrix)를 이용한다면 비교적 컴퓨터가 이해하기 쉽게 그래프를 표현할 수 있습니다. 각 column과 row에 순서대로 node set을 정의하고, edge로 연결이 되어 있으면 1, 그렇지 않으면 0으로 채워주어 간단하게 인접 행렬을 구할 수 있습니다. 보통, 인접 행렬은 자기 자신으로 가는 edge가 없기 때문에 대각 원소(diagonal elements)를 0으로 채웁니다. 하지만 GCN에서는 자기 자신의 정보를 이용하기 위하여 1로 채워 줍니다.

이 외에도 그래프는 다음과 같은 상황에서 응용 될 수 있습니다.

SNS에서 관계 네트워크
학술 연구에서 인용 네트워크
3D Mesh

Graph Convolutional Networks

Fig 3. An example of Graph Convolutional Networks. Image taken from Thomas Kipf’s blog post

Convolutional Neural Networks(CNN)에서 픽셀 대상으로 하던 합성곱(convolution) 연산을 Graph Convolutional Networks(GCN)에서는 그래프에 적용하자는 것이 가장 기본적인 아이디어입니다.

Input

Fig 4. Input matrices of Graph Convolutional Networks

GCN에서는 다음의 두 행렬을 입력으로 받습니다.

A : 그래프의 인접 행렬
X : N×D feature matrix (N = nodes의 수, D = vertex feature의 차원)

예를 들어, 그래프 구조가 SNS에서 친구들의 관계를 나타내는 네트워크라면 node는 사람이 될 것이고, edge는 사람들 간의 friendship의 정도가 될 것입니다. 이 때, 특징 행렬 X는 각 node의 feature(나이, 신장, 몸무게, 결혼 유무, 흡연 유무 등)로 만들어진 행렬을 의미합니다.

Output

GCN은 node-level output 혹은 graph-level output이 모두 가능합니다. 이는 우리가 해결해야 할 task가 어떤 형태인지에 따라 달라지게 됩니다. 예를 들어, SNS 관계 네트워크에서 사람 단위로 분류하고 싶은 경우에는 node-level output이, 약을 분류하고 싶은 경우에는 graph-level output이 적절할 것입니다.

Node-level output Z : N×F feature matrix (N = nodes의 수, F= node feature의 차원)
Graph-level output은 pooling 연산을 이용

How to update node feature

Fig 5. Information needed to update feature of node b(left), node a(right)

Node feature를 업데이트 하기 위하여 자기 자신의 정보와 인접한 노드들의 정보를 함께 이용합니다. 예를 들어, 노드 b를 업데이트 하기 위해서 노드 a, b, c, d의 정보를 이용하고(Fig 5.의 좌측 그림), 노드 a를 업데이트 하기 위해서는 노드 a, b의 정보만을 이용하면 됩니다(Fig 5.의 우측 그림).

이를 수식으로 다음과 같이 나타낼 수 있습니다.

where

다시 말해, l+1 레이어에서 node i의 feature를 업데이트 하는 방법은 nodes(node i와 인접한 노드들)의 weight를 곱해주고 bias를 더한 형태에 활성화 함수를 입힌 형태입니다.

이를 모든 노드에 대하여 다음과 같이 하나의 행렬식으로 표현할 수 있습니다.

여기서 주의할 점은 인접 행렬 A의 대각원소를 모두 1로 하여야 자기 자신의 정보를 이용한다는 것을 쉽게 보일 수 있습니다.

하지만, 위의 식을 바로 이용하게 되면 A를 정규화(normalization)하지 않기 때문에 연산 과정에서 feature vector의 scale이 완전히 바뀐다는 문제가 생기게 됩니다. 따라서 우리는 인접행렬 A를 다음과 같이 정규화하여 사용합니다.

where D is the diagonal node degree matrix,

D 행렬은 자신을 포함하여 몇개의 노드와 연결이 되어있는지를 나타내는 행렬이고, 인접 행렬 A의 각 row의 원소들을 더하여 쉽게 얻을 수 있습니다. 이렇게 얻은 D 행렬의 역함수를 구하고 루트를 씌어주어 인접 행렬 A 앞 뒤의 곱해주면 우리는 정규화 된 인접 행렬을 구할 수 있습니다.

실제 사용에서는 graph convolution layer를 세번 정도 거쳐 각 노드의 feature를 업데이트 하고 해당하는 task에 따라 classification 혹은 regression을 진행하면 됩니다.

Conclusion

Graph Neural Networks는 강력합니다. 더군다나, 그래프 구조로 표현 되는 drug discovery 분야에서는 더욱 강력합니다. 그 중 GNN을 대표하는 Graph Convolutional Networks에 대해 알아봤습니다. 이를 시작으로 SOTA graph model을 공부하여 drug discovery에 적용한다면 빠른 시일 내에 인공지능으로 만든 약을 시중에서 볼 수 있을 것이라 예상합니다.

References

[1] Thomas Kipf’s blog post

[2] SEMI-SUPERVISED CLASSIFICATION WITH GRAPH CONVOLUTIONAL NETWORKS Thomas N. Kipf and Max Welling, ICLR 2017

[3] Slide by DonghyeonKim

Graph Convolutional Networks was originally published in BioAI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Drug Discovery and Development

Hery Chung — Sat, 14 Dec 2019 15:18:23 GMT

The discovery and development of new medicines is a long and complex process. The development of one new medicine takes about 10–15 years from the time it is discovered to when it is available for treating patients. The average cost of research and development for a single successful drug is estimated to be $800 million to $1 billion. For every 5,000–10,000 compounds that enter the research and development (R&D) pipeline, ultimately only one receives approval.

Figure 1. Drug discovery and Development pipeline. (Pharma, 2007)

The overall process to develop a new drug can be divided into a discovery phase and development phase:

I. The discovery process includes all early research to identify a new drug candidate and testing it in the laboratory. This process takes approximately 3–6 years and by the end, researchers hope to have a promising candidate drug to test in people.

II. The development process involves a series of clinical trials, each with its own specific goals and requirements. A candidate drug must prove to be safe and effective before it can be approved. This process takes an average of 6–7 years.

In the Discovery process, first a target related to a specific disease needs to be identified. The next step is hit identification, where compounds are identified from molecular libraries. Structure–activity and in silico studies in combination with cellular functional tests are used in an iterative cycle to improve the functional properties of newly synthesized drug candidates. Subsequently, in vivo studies such as pharmacokinetic investigations and toxicity tests are performed in animal models.

In the Development phase, the drug candidate which has now successfully passed preclinical tests, can be administered to patients in clinical trials. This step is marked by three phases that the potential drug must undergo sequentially. Phase I, drug safety testing with a small number of human subjects; Phase II, drug efficacy testing with a small number of people affected by the targeted disease; and Phase III, efficacy studies with a larger number of patients.

Figure 2. Summary of the overall Drug discovery and Development process. (Pharma, 2007)

Research is mostly focused on the Discovery phase in order to identify a promising compound for development. This phase can be broadly divided into major stages:

Pre-discovery: Choose and understand the disease
Target Identification: Choose a molecule to target with a drug
Find a “Lead Compound”: Find a promising molecule that could become a drug
Lead Optimization: Alter the structure of lead candidates to improve properties. (Structure-Activity relationships, improve target interactions and pharmacokinetic properties)
Preclinical Testing: Laboratory and animal testing to determine if the drug is safe enough for human testing.

Many computational tools are now available and can be used to aid in the drug discovery process, particularly in the rational design of new drugs. Important computer-aided-drug design strategies are summarized in figure 3.

Figure 3. Computer-aided drug design. Adapted from Molecules. 2019 May; 24(9): 1693.

Finally, advances in understanding human biology and disease open up new possibilities for breakthrough medicines. These possibilities will grow as our scientific knowledge expands and becomes increasingly complex.

H. C. S. Chan, H. Shan, T. Dahoun, H. Vogel, and S. Yuan, “Advancing Drug Discovery via Artificial Intelligence,” Trends Pharmacol. Sci., vol. 40, no. 8, pp. 592–604, Aug. 2019.

“Drug Discovery and Development, Understanding de R&D Process” Pharmaceutical Research and Manufacturers of America, 2007, innovation.org.

M. Aminpour, C. Montemagno, and J. A. Tuszynski, “An Overview of Molecular Modeling for Drug Discovery with Specific Illustrative Examples of Applications,” Molecules, vol. 24, no. 9, p. 1693, Apr. 2019

Drug Discovery and Development was originally published in BioAI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Deep learning for Drug-Target Interaction

Chunkyun — Tue, 12 Nov 2019 12:44:07 GMT

Interpretable Drug Target Prediction Using Deep Neural Representation

이 글은 논문 Gao, Kyle Yingkai, et al. “Interpretable Drug Target Prediction Using Deep Neural Representation.” IJCAI. 2018.를 기반으로 작성하였습니다.

Introduction

Drug-Target interactions(DTIs)을 예측하는 것은 신약 개발에 있어 중요한 역할을 합니다. 여기서 Drug은 chemical compounds를 말하며 Target은 proteins을 의미합니다. 기존 DTIs를 예측하는 방법은 1) Molecular docking 2) Machine learning 등이 있습니다.

1) Molecular docking

3D 시뮬레이션을 통하여 안정적인 complex가 되는 상태(score function이 가장 높아지는 경우)를 찾아 DTIs를 예측합니다. 자세한 설명은 여기를 참고해주세요.

2) Machine learning

적절한 feature를 찾기 위해 domain 지식을 기반으로 머신러닝 기법을 적용하여 DTIs를 예측합니다.

방법 1)은 시간 및 비용 소모가 크고 방법 2)는 높은 수준의 domain 지식을 요구합니다. 이러한 문제점을 보완하기 위하여 최근에 DTIs 예측을 딥러닝으로 해결하고자 하고 있습니다.

Problem Formulation

Model Architecture

Figure 1: Overall data flow and neural network architecture.

모형 구조는 Figure 1과 같습니다. Input은 Target(protein)과 Drug으로, 앞서 문제를 정의한 형태로 들어가게 됩니다. 이 모형에서 제시하는 핵심 네트워크 구조는 크게 네 가지입니다 (Figure 1에서 주황색 하이라이트로 표시). 네 가지 방법은 다음과 같은 기능을 합니다.

1) LSTM

Protein을 아미노산 embedding representation을 한 후, sequential한 input 관계를 모형으로 표현하는 역할을 합니다. 우리가 잘 알고 있는 LSTM과 모형과 거의 유사하며 논문에서 사용한 모형은 아래와 같습니다.

LSTM structures

2) Graph Convolutional Neural Networks

Graph Neural Networks (GNNs)은 현재 제약과 관련된 딥러닝 모형 중 가장 많이 활용되고 있습니다. Drug은 분자 구조로 표현할 수 있으며 분자 구조는 그래프 형태로 표현 할 수 있기 때문에 GNNs이 적절하다고 여겨지고 있습니다. Drug을 모형의 입력으로 사용하기 위해서 인접 행렬 (Adjacency matrix)로만 표현하면 되기 때문에 기존에 많이 이용하던 Extended-Connectivity FingerPrints (ECFPs)보다 쉽게 이용할 수 있다는 장점이 있습니다.

논문에서는 GNNs을 대표하는 모형 중 가장 널리 이용되는 Graph Convolutional Neural Networks(graph CNN)을 사용하였습니다. 약의 그래프들은 graph CNNs을 거치면 dense vector로 표현할 수 있고, 이 dense vector를 이용하여 classification 문제에 적용할 수 있습니다. 그래프의 노드(nodes)는 atom이 되며 엣지(edge)는 atom과 atom 사이를 잇는 bond라고 생각할 수 있습니다. GNNs은 제약 딥러닝에서 뗄 수 없는 관계이므로 깊게 이해를 하고 있어야 합니다. 우선 여기까지만 설명 하고 추후에 GNNs에 대해 자세히 설명하도록 하겠습니다.

3) Attentive Pooling Networks

단순히 DTIs를 예측하는 것이 아니라 결과에 대하여 해석을 제공 할 수 있는 attention mechanism을 이용하였습니다. 그 중에서도, two-way attention으로 pairwise 추론이 가능한 Attentive pooling network를 이용합니다. Attention mechanism과 비슷한 Notation을 가지며 다음과 같습니다.

Interpretation을 위해 input units마다 각각 중요도를 계산해 이를 바탕으로 attention weights를 계산해야 합니다.

마지막으로, 소프트맥스를 이용하여 weights을 normalization한 결과를 weights으로 이용하게 됩니다.

4) Siamese network

Protein, Drug의 attention으로부터 얻은 벡터로 유사도(interaction이 있을 확률)를 계산하고 threshold를 이용하여 interaction의 유무를 예측하는 과정입니다.

Loss Function

Drug과 Protein의 모든 조합에 대하여 labeling을 할 수 없습니다. 따라서 주어진 protein에 대하여 interaction을 가지는 drug과 갖지 않는 약 사이에 margin을 최대로 하는 Pairwise ranking loss function을 이용하였습니다. Pairwise ranking loss function은 다음과 같습니다.

Experiment Dataset

트레이닝 데이터와 테스트 데이터를 나누고 난 후 트레이닝 셋에 있는 drugs과 proteins이 테스트셋에 얼마나 포함되어 있느냐에 따라 성능에 차이가 날 수 있습니다, 따라서 공정한 비교를 위해 다음 네 가지 상황으로 나누어 각 모형의 성능을 비교하였습니다.

1) 트레이닝 데이터에 drug, protein이 모두 있는 경우

2) 트레이닝 데이터에 protein은 있지만 drug은 없는 경우

3) 트레이닝 데이터에 drug은 있지만 protein은 없는 경우

4) 트레이닝 데이터에 drug, protein이 모두 없는 경우

트레이닝 데이터에 관측이 된 경우 상대적으로 관측이 되지 않은 경우보다 더 높은 정확도를 보일 것입니다. 하지만 우리는 관측이 되지 않는 경우에도 높은 정확도를 얻어야 되기 때문에 이러한 상황으로 나누어 결과를 비교하게 됩니다.

Results

논문에서 제시한 모형은 E2E/GO(GO annotation을 사용하지 않은 경우), E2E입니다. 기존 방법보다 상황 1), 2)에서는 정확도 등 좋은 결과를 보이지 못 했지만 상황 3), 4)에서는 더 좋은 결과를 보였습니다.

Interpretability

Attentive pooling network를 바탕으로 얻은 attention weights은 protein의 어느 부분이 drug과의 interaction을 야기하는지 찾을 수 있습니다.(GLU 160, PHE 159). 마찬가지로 반대의 경우에도 drug의 관점에서 어느 부분이 pretein과의 interaction을 야기하는지 알 수 있습니다.(saturated red로 표기)

이러한 과정은 Molecular docking과 유사한 결과를 얻을 수 있었으며 딥러닝 모형을 이용하면 이러한 과정에서 소요되는 시간과 비용이 절감되어 신약개발에 도움을 줄 것이라 기대하고 있습니다.

Deep learning for Drug-Target Interaction was originally published in BioAI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Molecular Docking

BioAI — Mon, 28 Oct 2019 16:50:15 GMT

As technology advances, new tools are available to study and understand the interaction between ligands and proteins. As a result, many methods to predict drug-target interaction have been developed and are used for the new drug discovery process.

Photo by Thought Catalog on Unsplash

Molecular Docking is a well-established in silico method widely used in structure-based drug design. It predicts the ligand-protein complex structure by modeling ligand and protein interactions at the atomic level. Through this process, we can gain useful insight into the ligand-binding modes and binding process, which can be applied to identify novel compounds of therapeutic interest.

In general terms, molecular docking aids in:

Ligand and protein interaction modeling
Prediction of ligand-protein binding structure
Prediction of ligand activity on the target protein binding site

Fig 1. Protein-ligand complex formation process

Ligand: Drug (Candidate)

Protein: Drug target (e.g. Enzyme, Receptor, Ion Channel, Transporter, etc)

Molecular docking method predicts the preferred relative orientation of one molecule (key) when bound in an active site of another molecule (lock) to form a stable complex such that free energy of the overall system is minimized.

Fig 2. Ligand binds to the active site of the target protein according to the key and lock model to form the protein-ligand complex.

The molecular docking process involves basically two main steps:

1) Sampling of different ligand conformations, positions and orientations within a particular binding site of the target.

2) Evaluation of the obtained docking poses through a scoring function in terms of a binding affinity

The protein docking problem can be defined as an optimization problem where the objective is to minimize the binding energy of the complex.

The following figure (Fig. 3) illustrates an example of the Drug and Drug-target complex. A Selective Serotonin Receptor Inhibitor (SSRI) bound to the 5-HT re-uptake transporter (5-HTT or SERT) inhibits the re-uptake of serotonin into the pre-synaptic neurons. This in turn, helps to maintain normal concentrations of serotonin in the synapses which is used in the treatment of depression.

Fig 3. Example of drug-target binding, SSRI bound to 5-HT re-uptake transporter.

The drug target can vary but they are mostly proteins such as an enzymes, ion channels, and transporters. Therapeutically popular drug-targets can change over time.

Fig 4. Example of drug-target

The variety of docking software available can be distinguished based on two basic components: the sampling algorithm the scoring function.

Sampling Algorithm

There are too many numbers of cases of ligand and protein active site binding. It is impossible to make all possible ligand conformation and compare. Therefore, there are three sampling algorithms that are commonly used.

Pharmacophore based Algorithm

A pharmacophore is a description of molecular features that are essential for molecular recognition of a ligand by a specific target and can be responsible for a particular biological or pharmacological action.

This is a type of matching algorithm based on molecular shape where the protein and the ligand are represented as pharmacophores. Ligands are mapped into the active site of a protein in terms of shape features and chemical information.

Fig 5. Pharmacophore based Algorithm

2. Fragment-based Algorithm

This algorithm divides the ligand into several fragments that are docked separately. The largest fragment is most likely to have an important functional role, it is docked first. The remaining fragments are added incrementally afterward.

Fig 6. Fragment-based Algorithm

3. Stochastic Search Algorithm

These methods search the conformational space by randomly changing the ligand conformation. Ligand poses are generated through bond rotation, rigid-body translation or rotation. Poses are then selected based on an energy criterion. If the pose it passes the criterion, it is used to generate the next conformation. This iteration is repeated until the pre-defined quantity of conformations is reached.

Fig 7. Stochastic Search Algorithm

Scoring function

The scoring function is used to evaluate the conformations generated by a sampling algorithm. It is used to discriminate the binders from the inactive compounds. They involve estimating the binding affinity between the protein and ligand and can be divided into force-field-based, empirical and knowledge-based scoring functions.

1. Force-field-based scoring functions

Asses the binding energy by calculating the sum of the non-bonded (electrostatics and van der Waals) interactions. The electrostatic terms are calculated by a Coulombic formulation and the van der Waals terms are described by a Lennard-Jones potential function. Cut-off distances are used to treat non-bonded interactions.

2. Empirical-based scoring functions

Binding energy is decomposed into several energy components, such as hydrogen bond, ionic interaction, hydrophobic effect, and binding entropy. Each component is multiplied by a coefficient and then summed up to give a final score. Coefficients are obtained from regression analysis fitted to a test set of ligand-protein complexes with known binding affinities.

3. Knowledge-based scoring functions

Use statistical analysis of ligand-protein complexes crystal structures to obtain the interatomic contact frequencies and/or distances between the ligand and protein. These frequency distributions are then converted into pairwise atom-type potentials. The score is calculated by favoring preferred contacts and penalizing repulsive interactions between each atom in the ligand and protein within a given cutoff.

References

[1] V. Khanna and N. Petrovsky, “Rational Structure-Based Drug Design,” Encycl. Bioinforma. Comput. Biol., pp. 585–600, Jan. 2019.

[2] X.-Y. Meng, H.-X. Zhang, M. Mezei, and M. Cui, “Molecular docking: a powerful approach for structure-based drug discovery.,” Curr. Comput. Aided. Drug Des., vol. 7, no. 2, pp. 146–57, Jun. 2011.

[3] S. Zarbafian et al., “Protein docking refinement by convex underestimation in the low-dimensional subspace of encounter complexes,” Sci. Rep., vol. 8, no. 1, p. 5896, Dec. 2018.

[4] L. Pinzi and G. Rastelli, “Molecular Docking: Shifting Paradigms in Drug Discovery,” Int. J. Mol. Sci., vol. 20, no. 18, p. 4331, Sep. 2019.

[5] A. Sethi, K. Joshi, K. Sasikala, and M. Alvala, “Molecular Docking in Modern Drug Discovery: Principles and Recent Applications,” in Drug Discovery and Development — New Advances [Working Title], IntechOpen, 2019.

Molecular Docking was originally published in BioAI on Medium, where people are continuing the conversation by highlighting and responding to this story.