Protein Data Science : A Perspective from Dr. Christine Orengo

Sayane Shome
PLOS Comp Biol Field Reports Blog
7 min readMar 3, 2017
Fig 1 : Overview of the central dogma of molecular biology (Image credit: Daniel Horspool, Wikimedia Commons)

Proteins form an integral part of biological systems, as they carry out the actual functional processes such as facilitating metabolism, transporting molecules from one location to another, DNA (genetic material responsible for passing biological information from one generation to other) replication, and transmitting signals in cells. The role of the protein is decided based on different structural features (such as domains, motifs etc.), which in turn, are determined based on the protein’s amino acid composition and protein folding processes (Figure 2).

Fig 2 : Amino-acid composition at protein sequence level determines the protein structure and function (Image credit: Sayane Shome)

In today’s era of biological data, researchers seek to understand and determine the structural annotations in the proteins. These will further help to determine which regions are crucial for biological processes and how any anomalies (mutations) in these regions hamper the conventional biological processes, causing diseases and other biological phenomena of interest.

Dr. Christine Orengo (Image credit: Wikipedia)

In view of this, database repositories containing protein structure and sequence-based information are useful resources for computational biologists as well as experimental biologists.

Recently, I had the privilege of interviewing Professor of Bioinformatics, Dr. Christine Orengo from University College London (UCL). Dr. Orengo is reputed for her extensive work on protein structures and databases, specially the CATH database, which has been widely utilized for accessing structure-based information and annotation related to proteins. She is a member of EMBO, vice-president of the ISCB, and currently serves as an Associate Editor for PLOS Computational Biology. In this interview, Dr. Orengo gives thoughtful insights into how and why her research group developed the database repositories Genome3D and CATH-Gene3D.

( Note : From here onwards, the interview questions are in bold and my input has been included in italics)

  1. Genome3D and CATH-Gene3D are well-known softwares developed by your lab group and in collaboration with other research groups for structural prediction and annotation of protein sequences whose structures are unknown. How their functionalities and application usage differ ?

Genome3D is a collaborative project between seven UK structure-based groups, which provides structural feature predictions for protein sequences from ten model genomes. By combining annotations from multiple resources, Genome3D builds a more informative consensus for the user.

CATH-Gene3D is a part of Genome3D and provides structural domain information for the sequences from the ten model genomes. The CATH resource identifies protein domains from 3D protein structures in the Protein Data Bank and classifies them into the CATH structural hierarchy. The latest official release of CATH (v4.1) assigns 308,999 protein domains into 2,737 homologous superfamilies. A daily update, CATH-B, is provided for users who want access to the very latest domain assignments. This contains around 417,000 protein domains assignments. As the structure of every protein is not known and therefore cannot be classified by CATH, Gene3D uses CATH structural domain HMMs to predict protein domains from UniProtKB and Ensembl sequences.

The CATH-Gene3D and SCOP databases are at the centre of the methods used by the resources in Genome3D. Both of these databases provide information on the hierarchical classification of protein domain structures.

Based on CATH and/or SCOP domains, members of Genome3D provide:

a) Structural annotations, to map regions of protein sequence to structural domains,
b) Structural models, to map regions of protein sequence to 3D models, and
c) Consensus superfamilies, where structural domains have been mapped between CATH and SCOP to identify common superfamilies.

In the past decade, whole-genome sequencing initiatives carried out for various organisms including the Human Genome Project revealed that the majority of genetic composition is similar across species. In humans, around 99 % of the genome is exactly the same across the entire human population. The characteristic differences in physical features and traits from individual to individual are caused because of variations in the remaining 1 % of the genome (a genome is the entire DNA composition of an organism). These variations have been described either as single-base nucleotide changes (SNPs) or as being caused by the insertion or deletion (INDELS) of nucleotide patterns in the DNA of individual. These variations at the genomic level are expressed at protein level due to the central dogma of molecular biology (Figure 1). Hence, studying these variations in the genome and their effect at the protein level is beneficial to understand the reasons for individual-specific features ( for example, an increased risk of cancer, obesity, or blue-colored eyes). Recently, researchers have been curious to determine these specific genetic patterns and their implications at the protein level in the organism.

2. Could you share with the readers how Genome3D helps biologists to determine whether genetic variations such as non-synonymous SNPs (single nucleotide polymorphism) are deleterious for the protein structure and function?

Protein structure data is very important in the understanding of a protein’s function. For example, looking at 3D structural data can illustrate how highly conserved residues involved in catalysis cluster at an enzyme active site. The structure data and the 3D models provided by Genome3D can also be used to explore the impacts of genetic variants identified by next-generation sequencing projects. For example, the identification of non-synonymous single nucleotide polymorphisms (nsSNPs) and alternative splice variants can affect protein structure and function, therefore, if we look to see whether these mutations occur at, or close to, structurally conserved regions (implies regions where less variations are observed) and/or regions of functional importance such as enzyme active sites, we can explore the effects of these mutations and find out how damaging they are.

3. Can the consequences of other genomic variants such as INDELS (Insertions and Deletions) and repeats in sequences be predicted by the Genome3D software?

In Genome3D, partners are able to submit two types of structural prediction: domain assignment (i.e. the location a structural domain) and full 3D models (i.e. what the predicted structure actually looks like). The 3D models from methods such as PHYRE2, THREADER and FUGUE are based on alignments to known structures and therefore take into account the evolutionary mechanisms such as insertions and deletions (INDELS).

4. Could you detail the new additions in CATH-Gene3D software and the future prospects of expansion for CATH-Gene3D?

● FunFams (functional families)

● Latest version of CATH recently released

● CATH-B

● New web pages — home page has more information for users, sequence search updated

● 3D models in Gene3D

The latest release of CATH (v4.1) comprises over 300,000 domain structures and 53 million protein domains classified into 2,737 homologous superfamilies. In addition CATH-B, which is updated daily, provides our very latest domain assignment data.

The expansion of CATH superfamilies with increasing amounts of sequence data and functional data has revealed that the universal and most highly populated protein superfamilies can incorporate a large amount of structural and functional diversity. This prompted the development of automated protocols to sub-classify the superfamilies into FunFam (functional families) which aim to group together relatives that are likely to share the same function. Therefore, if a region of protein sequence provides a highly significant match to a particular CATH FunFam, then there is a good chance they share a similar function. The FunFams also allows for comparison of functional sites between relatives across a superfamily, and gives insights into evolutionary mechanisms underlying shifts in function.

Our function prediction pipeline based on these FunFams was ranked highly for accuracy of function prediction in the recent international function prediction competition (CAFA 2). The conserved residues in FunFam alignments are also significantly enriched in known functionally important residues. The CATH ‘search by sequence’ web server provides fast, domain-based functional annotations for sequences based on FunFam assignments. The CATH webpages also show the highly conserved residues in the FunFams highlighted on a representative 3D-structure for the FunFam. This comprehensive functional classification of domain sequences, linked to structural data, is of great importance for identifying functional sites for clinically or industrially important proteins, and understanding the effect of mutations in disease-causing proteins such as in cancer, heart diseases etc.

5. Please share your thoughts on how research into protein structure and function annotation will be useful for practical applications in future.

Knowledge of the protein repertoire is expanding rapidly as the international genomics initiatives continue. We now know the sequences, and in some cases, the structures of proteins from many important model organisms. However, less than 1% of these proteins have been experimentally characterized (UniProtKB, 2016). In addition, metagenomics initiatives are identifying millions of bacterial proteins found in human hosts such as the gut microbiome to understand their importance for human health. Clinical data is also accumulating, which links genetic variations in human proteins with disease. To make sense of all this data, we need to predict the structure and functions of the proteins identified and determine the location of residue sites, essential for these functions. The site data will be particularly valuable in a clinical context where residue mutations are being analyzed to understand their impact for particular diseases. Thus, research towards improving the structural coverage of proteins by building more, and better quality, 3D models and improving the accuracy of protein function predictions will be essential for annotating the ever-increasing wealth of biological data.

The study of proteins at the structural and sequence level is essential as these macromolecular entities are responsible for the physical composition of living organisms and related biological processes. Any change at the amino acid sequence (protein) level (whether it be due to mutations or nucleotide variations at genetic level) impacts the protein at a structural level, which in turn impacts the biological functions of proteins. Hence, as biological data is being generated via sequencing, crystallography, and NMR techniques, it becomes equally important to annotate them and determine their importance in the sustenance of living organisms. For this, the development of web-repositories and tools are essential, and facilitate our voyage to explore and understand biological systems.

Disclaimer: The author would like to thank Dr. Orengo for her input and PLOS Computational Biology for facilitating the interview process. Any views expressed are those of the author, not necessarily those of PLOS.

--

--

Sayane Shome
PLOS Comp Biol Field Reports Blog

Graduate student,BCB program,Iowa StateU; RSG Committee Chair and Executive team member,ISCB-SC.