protPy — a Python package for protein physicochemical, biochemical and structural descriptors

AJ McKenna
8 min readNov 27, 2023

protPy is a Python software package for generating a variety of useful physicochemical, biochemical and structural descriptors for protein sequences. All of these descriptors are calculated using sequence-derived or physicochemical features of the constituent amino acids that make up the proteins.

Introduction

protPy is a Python software package for generating a variety of useful physicochemical, biochemical and structural descriptors for protein sequences. All of these descriptors are calculated using sequence-derived or physicochemical features of the amino acids that make up the proteins [1, 2]. To date, there are 12 primary descriptors available in protPy, each are explained in further detail in the Background section below and each are abbreviated for simplicity. The supported descriptors include: Amino Acid Composition (AAComp), Dipeptide Composition (DPComp), Tripeptide Composition (TPComp), Pseudo Amino Acid Composition (PAAComp), Amphiphilic Amino Acid Composition (APAAComp), Moreaubroto Autocorrelation (MBAuto), Moran Autocorrelation (MAuto), Geary Autocorrelation (GAuto), Conjoint Triad (CTriad), CTD (Composition, Transition, Distribution) (CTD), Sequence Order Coupling Number (SOCN) and Quasi Sequence Order (QSO).

This software is aimed at any researcher or developer using protein sequence/structural data and was mainly created to use in my own project pySAR which maps protein sequence data to a desired characteristic/fitness/activity, known as a Sequence Activity Relationship (SAR), using Machine Learning (ML) [3]. protPy was developed in Python 3.10.

The source code for protPy is open-source and available here, and a demo including use cases and various usage examples is available here.

Background

Proteins are an essential part of organic life, being involved in virtually every know biological and cellular process. Proteins are involved in replicating and transcribing DNA, controlling cellular division, metabolism, enzymes, cell signalling and ligand binding, among many other processes [4]. The extrapolation of useful properties/descriptors from proteins is of great interest for researchers in the fields of proteomics, bioinformatics and protein engineering. Many useful sequence-derived structural and physicochemical descriptors have been assembled in the literature. These descriptors have been highly studied and utilised in a variety of applications, including: directed evolution [5], protein structure and functional class prediction [6], protein subcellular location prediction [7], drug-target interactions [8], sequence activity relationships [3], etc.

Each of the aforementioned descriptors fall into the categories of: composition, autocorrelation, conjoint triad, CTD or sequence order. Below is a more detailed explanation for each:

  • AAComp: the proportion of each amino acid type within a protein sequence [9].
  • DPComp: the proportion of each dipeptide type within a protein sequence [9].
  • TPComp: the proportion of each tripeptide type within a protein sequence [9].
  • PAAComp: combines the vanilla AAComp descriptor with additional local features, such as correlation between residues of a certain distance, as AAComp doesn’t take into account sequence order info. The pseudo components of the descriptor are a series of rank-different correlation factors. The first 20 components are a weighted sum of the amino acid composition and 30 are physicochemical square correlations as dictated by the lambda and properties parameters [10, 11, 12].
  • APAAComp: has the same form as the AAComp descriptor, but contains much more information that is related to the sequence order of a protein and the distribution of the hydrophobic and hydrophilic amino acids along its chain. The first 20 numbers in the descriptor are the components of the conventional amino acid composition; the next 2*λ numbers are a set of correlation factors that reflect different hydrophobicity and hydrophilicity distribution patterns along a protein chain [10, 11].
  • MBAuto: autocorrelation descriptors are a class of topological descriptors, also known as molecular connectivity indices, that describe the level of correlation between two objects (protein or peptide sequences) in terms of their specific structural or physicochemical properties, that are defined based on the distribution of amino acid properties along the sequence. MBAuto uses the property values as the basis for measurement. Each autocorrelation will generate the number of features depending on the lag value and number of properties input with the total features being the product of the lag and number of properties [2, 13].
  • MAuto: (autocorrelation description as above). MAuto utilises property deviations from the average values [2, 13].
  • GAuto: (autocorrelation description as above). GAuto utilises the square-difference of property values instead of vector-products (of property values or deviations) [2, 13].
  • CTriad: mainly considers neighbour relationships in protein sequences by encoding each protein sequence using the triad (continuous three amino acids) frequency distribution extracted from a 7-letter reduced alphabet. This descriptor calculates 343 different features (7x7x7), with the output being of shape 1 x 343 for a sequence [11].
  • CTD: composition is determined as the number of amino acids of a particular property divided by the total number of amino acids. The shape of the output will be 1 x 3, with 3 features being generated per sequence. Transition is determined as the number of transitions from a particular property to different property divided by (total number of amino acids − 1). The shape of the output will be 1 x 3, with 3 features being generated per sequence [2, 9].
  • SOCN: computes the dissimilarity between amino acid pairs. The distance between amino acid pairs is determined by d which varies between 1 to lag. For each d, it computes the sum of the dissimilarities of all amino acid pairs. The number of output features can be calculated as N, where N = lag, by default this value is 30 which generates an output of 1 x 30 [12, 14, 15, 16].
  • QSO: derived from the distance matrix between the 20 amino acids. By default, the Scheider-Wrede physicochemical distance matrix was used. Also utilised in the descriptor calculation is the Grantham chemical distance matrix. Both of these matrices are used by Grantham et. al. in the calculation of the descriptor. N + 20 values are calculated per sequence, where N is the lag, with a default lag of 30, the output will be 1 x 50. There is also a weighting factor that can be assigned to determine that weight per amino acid [12, 14, 15, 16].

Installation

protPy an be installed using the pip package manager, and requires the external Python libraries numpy, pandas and varname:

pip install protpy

Usage examples

Each protein descriptor function accepts a single protein sequence string as input. The most popular format for protein sequence data is via the FASTA format; to import and read protein sequences via a FASTA file:

from Bio import SeqIO

with open("test_fasta.fasta") as pro:
protein_seq = str(next(SeqIO.parse(pro,'fasta')).seq)

Calculate Amino Acid Composition:

import protpy as protpy

amino_acid_composition = protpy.amino_acid_composition(protein_seq)
# A C D E F ...
# 6.693 3.108 5.817 3.347 6.614 ...

Calculate Dipeptide Composition:

import protpy as protpy

dipeptide_composition = protpy.dipeptide_composition(protein_seq)
# AA AC AD AE AF ...
# 0.72 0.16 0.48 0.4 0.24 ...

Calculate Tripeptide Composition:

import protpy as protpy

tripeptide_composition = protpy.tripeptide_composition(protein_seq)
# AAA AAC AAD AAE AAF ...
# 1 0 0 2 0 ...

Calculate MoreauBroto Autocorrelation:

import protpy as protpy

#using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True
moreaubroto_autocorrelation = protpy.moreaubroto_autocorrelation(protein_seq)
# MBAuto_CIDH920105_1 MBAuto_CIDH920105_2 MBAuto_CIDH920105_3 MBAuto_CIDH920105_4 MBAuto_CIDH920105_5 ...
# -0.052 -0.104 -0.156 -0.208 0.246 ...

Calculate Moran Autocorrelation:

import protpy as protpy

#using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True
moran_autocorrelation = protpy.moran_autocorrelation(protein_seq)
# MAuto_CIDH920105_1 MAuto_CIDH920105_2 MAuto_CIDH920105_3 MAuto_CIDH920105_4 MAuto_CIDH920105_5 ...
# -0.07786 -0.07879 -0.07906 -0.08001 0.14911 ...

Calculate Geary Autocorrelation:

import protpy as protpy

#using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True
geary_autocorrelation = protpy.geary_autocorrelation(protein_seq)
# GAuto_CIDH920105_1 GAuto_CIDH920105_2 GAuto_CIDH920105_3 GAuto_CIDH920105_4 GAuto_CIDH920105_5 ...
# 1.057 1.077 1.04 1.02 1.013 ...

Calculate Conjoint Triad:

import protpy as protpy 

conjoint_triad = protpy.conjoint_triad(protein_seq)
# 111 112 113 114 115 ...
# 7 17 11 3 6 ...

Calculate CTD:

import protpy as propty

#using default parameters: property="hydrophobicity", all_ctd=True
ctd = protpy.ctd(protein_seq)
# hydrophobicity_CTD_C_01 hydrophobicity_CTD_C_02 hydrophobicity_CTD_C_03 normalized_vdwv_CTD_C_01 ...
# 0.279 0.386 0.335 0.389 ...

Calculate Sequence Order Coupling Numbers per distance matrix:

import protpy as protpy

#using default parameters: lag=30, distance_matrix="schneider-wrede"
socn_all = protpy.sequence_order_coupling_number(protein_seq)
# SOCN_SW1 SOCN_SW2 SOCN_SW3 SOCN_SW4 SOCN_SW5 ...
# 401.387 409.243 376.946 393.042 396.196 ...

#using custom parameters: lag=10, distance_matrix="grantham"
socn_all = protpy.sequence_order_coupling_number(protein_seq, lag=10, distance_matrix="grantham")
# SOCN_Grant1 SOCN_Grant_2 SOCN_Grant_3 SOCN_Grant_4 SOCN_Grant_5 ...
# 399.125 402.153 387.820 393.111 409.096 ...

Calculate Quasi Sequence Order:

import protpy as protpy

#using default parameters: lag=30, weight=0.1, distance_matrix="schneider-wrede"
qso = protpy.quasi_sequence_order(protein_seq)
# QSO_SW1 QSO_SW2 QSO_SW3 QSO_SW4 QSO_SW5 ...
# 0.005692 0.002643 0.004947 0.002846 0.005625 ...

#using custom parameters: lag=10, weight=0.2, distance_matrix="grantham"
qso = protpy.quasi_sequence_order(protein_seq, lag=10, weight=0.2, distance_matrix="grantham")
# QSO_Grant1 QSO_Grant2 QSO_Grant3 QSO_Grant4 QSO_Grant5 ...
# 0.123287 0.079967 0.04332 0.039983 0.013332 ...

Conclusion

protPy is an ideal tool to use for any researcher or developer working with protein data at the sequence level. The package offers an abundance of highly studied protein descriptors that allow you to extrapolate useful and important information hidden within a protein sequence.

References

[1]: https://github.com/amckenna41/protpy

[2]: Ong, S.A., Lin, H.H., Chen, Y.Z. et al. Efficacy of different protein descriptors in predicting protein functional families. BMC Bioinformatics 8, 300 (2007). https://doi.org/10.1186/1471-2105-8-300

[3]: Mckenna, A., & Dubey, S. (2022). Machine learning based predictive model for the analysis of sequence activity relationships using protein spectra and protein descriptors. Journal of Biomedical Informatics, 128(104016), 104016. https://doi.org/10.1016/j.jbi.2022.104016

[4]: B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, P. Walter, Molecular biology of the cell, 4th ed., CRC Press, Boca Raton, FL, 2002. [2] Y. Zhou, Y. Duan, Y. Yang, E. Faraggi, H. Lei, Trends

[5]: T. Shafee, Evolvability of a viral protease: experimental evolution of catalysis, robustness and specificity. Apollo — University of Cambridge Repository, 04-Feb- 2014.

[6]: C.Z. Cai, L.Y. Han, Z.L. Ji, X. Chen, Y.Z. Chen, SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence, Nucleic Acids Res. 31 (13) (2003) 3692–3697

[7]: J. Guo, Y. Lin, X. Liu, GNBSL: a new integrative system to predict the subcellular location for Gram-negative bacteria proteins, Proteomics 6 (19) (2006) 5099–5105.

[8]: P. Wang, X. Huang, W. Qiu, X. Xiao, Identifying GPCR-drug interaction based on wordbook learning from sequences, BMC Bioinf. 21 (1) (2020) 150.

[9]: Gromiha, M. M. (2010). Protein Sequence Analysis. In M. M. Gromiha (Ed.), Protein Bioinformatics (pp. 29–62). Elsevier.

[10]: Kuo-Chen Chou, Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes, Bioinformatics, Volume 21, Issue 1, January 2005, Pages 10–19, https://doi.org/10.1093/bioinformatics/bth466

[11]: J. Shen et al., “Predicting protein-protein interactions based only on sequences information,” Proc. Natl. Acad. Sci. U. S. A., vol. 104, no. 11, pp. 4337–4341, 2007.

[12]: Kuo-Chen Chou and Yu-Dong Cai. Prediction of Protein Subcellular Locations by GO-FunD-PseAA Predictor. Biochemical and Biophysical Research Communications, 2004, 320, 1236–1239.

[13]: B. Hollas, “An analysis of the autocorrelation descriptor for molecules,” J. Math. Chem., vol. 33, no. 2, pp. 91–101, 2003.

[14]: Kuo-Chen Chou. Prediction of Protein Subcellar Locations by Incorporating Quasi-Sequence-Order Effect. Biochemical and Biophysical Research Communications, 2000, 278, 477–483.

[15]: Gisbert Schneider and Paul Wrede. The Rational Design of Amino Acid Sequences by Artifical Neural Networks and Simulated Molecular Evolution: Do Novo Design of an Idealized Leader Cleavge Site. Biophys Journal, 1994, 66, 335–344.

[16]: Grantham, R. (1974–09–06). “Amino acid difference formula to help explain protein evolution”. Science. 185 (4154): 862–864. Bibcode:1974Sci…185..862G. doi:10.1126/science.185.4154.862. ISSN 0036–8075. PMID 4843792. S2CID 35388307.

--

--

AJ McKenna

Computer Scientist. Better at writing code than Medium articles tbh (https://github.com/amckenna41).