PlayMolecule® AceProfiler: Looking for homologs in the Protein Data Bank [TUTORIAL]

Alejandro Varela
PlayMolecule
Published in
5 min readMay 31, 2022

The Protein Data Bank is a database containing more than 190,000 structures of proteins and other biomolecules. This constitutes a great resource for structure guided drug discovery, however, it can be challenging to navigate this enormous amount of data. To tackle this problem, Acellera has developed AceProfiler, a new application in PlayMolecule to look for homologs of a provided structure or sequence, align them, and obtain a convenient set of .pdb files with the found hits.

Since the found hits are all aligned to a common reference, you can quickly identify structural changes among them (i.e. active and inactive states, apo and holo conformations, etc.), use them to guide the creation of an homology model or create a pharmacophore from the structures with bound ligands.

We hope you find this app useful and, if you are missing any features, let us know in the comments! Let’s see how it works with a couple of examples!

Let’s go to Playmolecule and click on the AceProfiler app. This app accepts a raw sequence, PDB codes or PDB files as input. Let’s start with the sequence of Cathepsin S, a target which has been the focus of several drug discovery campaigns for some years. You can copy-paste its sequence from below:

ILPDSVDWREKGCVTEVKYQGSCGASWAFSAVGALEAQLKLKTGKLVSLSAQNLVDCSTEKYGNKGCNGGFMTTAFQYIIDNKGIDSDASYPYKAMDQKCQYDSKYRAATCSKYTELPYGREDVLKEAVANKGPVSVGVDARHPSFFLYRSGVYYEPSCTQNVNHGVLVVGYGDLNGKEYWLVKNSWGHNFGEEGYIRMARNKGNHCGIASFPSYPEILQGGG

Simply copy-paste your protein sequence into the text area.

After a couple of minutes, you will obtain several structures aligned to the same reference. Some of them might contain bound ligands, although the results can vary from run to run.

Results for the Cathepsin S sequence example

In our case, we can see several structures with bound ligands. Most of them have a conserved interaction, where an hydrophobic ring system is stacked between two PHE residues. From these results, one could launch a docking or virtual screening campaign adding an hydrophobic restrain at that position (learn how to do it with AceDock).

Please notice that the free online version of AceProfiler is limited to 10 hits, while the private version discloses all the found hits and automatically prepares the proteins and their bound ligands. Contact us to get a license for a private instance!

Next, let’s try starting the search from a PDB code: 3PTB, which contains the trypsin-benzamidine complex. Simply type this PDB id into the first field in the right column and the complex will appear on the screen. Provide a name for the job (optional) and click the submit button. If the structure of that PDB id contains several segments, you must specify which one you want to use. The sequence of that segment is the one that will be used to search for homologs.

Input view.

Our results look like this:

Results for 3PTB

Again, you can notice the conserved binding mode. Also, notice that, for each structure, we provide the align score (which roughly indicates how good the overlap is between the reference structure and the found hits) and the BLAST score.

Finally, we can input a PDB file. This can be particularly convenient as we can provide the chunk of the protein we are interested in, and align the hits to this particular chunk. Let’s see an example from a GPCR. You can find the input PDB file in the examples tab.

Visit the Examples tab to get the PDB file.

Upload that PDB file to AceProfiler and submit!

Input view after uploading the PDB file

After submission, several structures are found and aligned, obtaining a view like this:

You can also check the nice overlap between our input structure and some of the hits:

Green helix is the structure in our input PDB file. Pink structured is the aligned hit found by AceProfiler.

Notice our input structure in green, and how well it overlaps with the hit (pink). Keep in mind that you need to provide chunks of meaningful size (more than 10 residues is a good rule of thumb) because otherwise there will be too many, unspecific hits.

Finally, if you download the results, you will find a few files:

Inside the raw_pdbs folder, you will find one PDB file for each hit and one for the reference structure on which the others were aligned. The align_blast*.csv files contain the results of the BLAST. The MSA_like.csv file contains the sequences of all the hits aligned, so you can quickly identify conserved residues. Notice, though, that this is not the result of a multiple sequence aligment per se, but it is good enough to gain a quick understanding of the variability in the sequences. From that alignment, we generate the MSA_Bfactor.pdb file, which has the degree of residue conservation mapped to the B factor column of the PDB, so you can visually inspect the degree of conservation of each region easily. The query.fasta file simply contains the sequence that was passed to BLAST during the execution of AceProfiler.

That’s all! Let us know your thoughts in the comments!

--

--