CombFunc - Help

CombFunc

Protein Function Prediction Server

Help

This page explains how to use CombFunc with details on the submission options and how the results page should be interpreted. If you would like further details about the methods used see the About page. Multiple examples of output from CombFunc can be viewed on the Examples page.

Submission Options

All that is required to run CombFunc is a protein sequence to use as a query. This is input in the text area on the submission page. Additionally to run features that are not sequence based the UniProt accession must also be passed. Without including it the gene Co-expression and Protein-Protein interaction analyses are not performed. The other options allow the user to specify an email address that results can be forwarded to and to assign a description to their submission.

The sequence should be submitted in fasta format:
>Description
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQ YMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIP YIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMSCKCVLS

or just the amino acid code:
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQ YMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIP YIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMSCKCVLS

Submission Progress

Each submission runs multiple processes so they may take up to an hour to complete. It is therefore advisable to enter an email address when making a submission but this is not essential. While the job is running the progress is updated so that the user can see how many of the processes have completed. The progress table is shown to the right.

The key at the top explains the colouring, with completed processes coloured blue and running processes coloured purple. If a process is off line then it is colour red. If a UniProt accession has not been submitted with the sequence then any processes that are not run are coloured grey.

Combined prediction refers to the final CombFunc process of making a combined prediction using data from each of the other processes.

Interpreting CombFunc Predictions

Overall Function Predictions

After running the multiple processes that form CombFunc, they are all combined to give an overall Function prediction. Further details of how the data from the different processes are combined can be found in About. In CombFunc the predictions are split into the two different Gene Ontology (GO) categories used - Molecular function (which describes the biochemical function of the protein) and Biological Process (which describes the larger scale processes that the protein is part of). The predictions for these two categories are displayed separately in tables, and also as graphs to allow the user to explore how the predicted functions are related within the Gene Ontology graph. Each of these displays is explained below. The results displayed on this page can be viewed in full here

Results Table

The results tables display the Gene Ontology functions that have been predicted for the query sequence. The first column displays the GO term that has been predicted. The description of the GO term is displayed in the Description field. By passing the mouse over a row the definition of the GO term is displayed to the right of the table. Additionally the blue GO symbol next to the term and the description links to the Gene Ontology page for the function at geneontology.org. The third column displays the number of SVMs (out of 10) that predicted the term to be an annotation of the sequence. The average probability score from the SVMs is displayed in the fourth column and can be used as indicator of the confidence of the predicted term. The SVM probability is colour coded to indicate the level of confidence, the red predictions have the highest confidence and yellow the lowest. This colour coding is used in the other displays for the combined predictions and also for each of the individual sets of data (see below).

The image view displays the predicted functions as a sub-graph of the gene ontology. This enables the user to see how the different predicted terms are related. These images can become very large so it is possible to zoom in on different area of the graph either using the mouse or by using the controls in the bottom right corner that control the zoom level and the movement of the image in the display. Again all of the predicted terms are coloured according to the confidence of their prediction. Parent terms are not coloured for clarity and descendent terms of the predicted terms are not disaplyed.

The list view displays similar information to the graph view but in a more compact way. Each of the predicted terms is displayed (coloured according to confidence of prediction). For each predicted term it is then possible to extend the list and view the parent terms of the prediction. The buttons are the top enable the complete list to be expanded using the "Expand All" button or collapsed using the "Collapse All" button.

The image on the right shows the list partially expanded to displaying the parent terms of GTPase activity and GTP binding.

The Gene Ontology pages for each function can be viewed by clicking on the blue GO next to the function description.

Individual Analyses

Data from each of the individual analyses is displayed below the overall predictions. This enables the user to explore the data that was used to make the prediction. Like the overall predictions, the scores associated with the different functions identified are coloured coded to give an indication of the strength of the data in support of the function. Some of the analyses generate considerable data so the data from the analyses is hidden and can be disaplyed by clicking the link adjacent to each heading. The data displayed for each analysis is explained below.

ConFunc Analysis
ConFunc is a sequence based method developed in-house [1]. The functions identified by ConFunc are displayed with the score calculated by ConFunc. The score columns are explained below.

Z Score - For each function a Z score is calculated to obtain the significance of the result for that function.

Z score ratio - The Z scores for each function are compared to the maximum Z score obtained for the functions present for the sequence. In this example the "protein binding" function had the highest Z score and so the Z score ratio is calculated by dividing the other Z scores by the "protein binding" Z score.

NumberSequences - The number of sequences homologous to the query that are identified by ConFunc and used for making the prediction.

BLAST Analysis
BLAST [2] is a widely used program for identifying sequences homologous to a query sequence. BLAST is often used to make very simple annotation transfer by transferring the annotation of the top hit to the query sequence. This can work well when the query and the hit are very similar (e.g. greater than 85-90% sequence identity)[3].

The BLAST analysis results display the top 3 GO annotated sequences for the both the GO Molecular Function and Biological Process categories. The details of the columns are described below:

Hit Acc - The UniProt accession of the BLAST hit. This is also a link to the uniprot page for this protein.

e-value - The BLAST e-value for the hit. The lower the e-value the more significant the match between the two sequences is.

%seq id - The sequence identity between the query and the hit sequence.

Query coverage - The percentage of the query sequence that is aligned with the hit sequence.

hit coverage - The percentage of the hit sequence that is aligned with the query sequence.

The final two columns list the Molecular Function and Biological process annotations of the hit respectively.

Intperpro Analysis Interpro [4] is a resource for identifying the protein domains/families that a protein sequence belongs. For this anlysis the results are displayed in both a tabular form and as an image. The table lists the Interpro domain hits for the query sequence. A mapping exists to map from Interpro hits to Gene Ontology functions. Where mappings exist for the Interpro hits the functions are displayed. The columns are explained below: Database - The source database for the hits (Interpro contains multiple different databases). Accession - The accession for the domain (This accession refers to the source database listed in the first column). start/finish - The start and finish positions on the query sequence for the domain hit. e-value - The e-value for the domain hit to the query sequence. The lower the more confident. This is colour coded for ease. Interpro - If the domain has been mapped to a specific interpro domain its id is displayed here. The id is a link to the intepro page for the domain, which will have further details about the domain. Desc - description of the interpro domain. GO Terms - GO terms mapped to the interpro domain. (each links to Gene Ontology where further information about the terms is available). GO Function - Description of the GO term function. Grpahical View - Each of the intepro hits is displayed along the length of the sequence to give an indication of where on the query sequence the domain hits occur. The hits are coloured according to their e-value.
Pfam Domain Combinations Analysis A method [5] is available to predict GO functions based on the combination of Pfam domains present in the query sequence. The output of this method is displayed in this section. Predictions are not associated with a confidence score.
Phyre2 Fold Library Search Phyre2 [6] is our in-house protein structure prediction server. In CombFunc we perform a search of the Phyre2 fold library, which is a set of all of the different protein folds present in the protein databank. This is done to identify if there are solved structures homologous to the query sequence. As the fold library is much smaller than the full UniProt sequence database it is possible to use a more sensitive sequencing searching program (hhSearch [7]). Hits to the fold library are displayed in a tabular format as shown on the right. The columns are explained below. The results table displays up to 100 hits. Pdb- The PDB identifier for the structure. This is a link to the PDB page for the structure, where more information about the structure can be obtained. chain - The chain in the PDB structure that is homologous to the query sequence. Probability- The probability calculated by HHsearch that the 2 proteins are homologous (range 0-100). Column is colour coded to indicate confidence e-value - Calculated by HHsearch to indicate the significance of the match between the 2 sequences. Column is colour coded to indicate confidence. Query range - the region of the query sequence that is aligned to the structure sequence Template range - The region of the structure sequence that is aligned to the query sequence. Type - There are two types - PDB - indicates a hit to a structure that is in the PDB and SCOP - Indicates that the strcuture in the PDB has also been classified into the Structural Classification of Proteins [8]. For the hits to the SCOP type then the hit may be for only a single domain within a protein chain. Family/Superfamily/Fold/Class - If the Type is SCOP then the details of the classification of the domain. GO Annotation - GO annotation of the protein structure obtained from the GOA for PDB.
Protein-Protein interaction analysis Proteins that interact with the query sequence are extracted from MINT [9] and IntAct [10] if the uniprot accession of the query is submitted. The analysis identifies how frequently GO functions occur in the directly interacting proteins. Indirect interactions are also identified (i.e. the proteins that interact directly with the direct interactors of the query protein)[11]. The columns in the table are described below. Count direct - The number of direct interacting proteins that are annotated with the given GO function %direct - The percentage of the direct interacting proteins that are annotated with the GO function Count indirect - The number of indirect interacting proteins that are annotated with the GO function %indirect - The percentage of the indirect interacting proteins that are annotated with the GO function
Gene Co-expression Analysis If a UniProt accession is provided during submission then Gene Co-expression data is extracted from COXPRESdb [12] for the query protein. The columns in the table are explained below. #CoExp - Number of co-expressed genes that are annotated with the function Av MR - Average Mutual Rank (see below) for the co-expressed genes. min MR - minimum Mutal Rank for the co-expressed genes. max MR - maximum Mutual Rank for the co-expressed genes. Mutual Rank rank is a measure implemented in COXPRESdb to measure co-expression. The lower the value the stronger the co-expression between the two genes.
3DLigandSite Submission Each CombFunc submission is also submitted to 3DLigandSite. 3DLigandSite is our in-house ligand binding site prediction server. So users are also able to use the results from 3DLigandSite to consider possible binding sites on their protein. A link is provided to the 3DLigandSite results page of the submission.

References

1. Wass MN, Sternberg MJE (2008) ConFunc--functional annotation in the twilight zone. Bioinformatics 24:798–806.
2. Altschul SF et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389–3402.
3. Devos D, Valencia A (2001) Intrinsic errors in genome annotation. Trends Genet 17:429–431.
4. Hunter S et al. (2012) InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res. 40:D306–D312.
5. Forslund K, Sonnhammer ELL (2008) Predicting protein function from domain content. Bioinformatics 24:1681–1687.
6. Kelley LA, Sternberg MJ (2009) Protein structure prediction on the Web: a case study using the Phyre server. Nat Protoc 4:363–371.
7. Söding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951-960.
8 Murzin AG, Brenner SE, Hubbard T, & Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247(4):536-540.
9. Ceol A et al. (2010) MINT, the molecular interaction database: 2009 update. Nucl. Acids Res. 38:D532–539.
10. Kerrien S et al. (2012) The IntAct molecular interaction database in 2012. Nucleic Acids Res. 40:D841–6.
11. Chua HN, Sung WK, Wong L (2006) Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics 22:1623–1630.
12. Obayashi T, Kinoshita K (2011) COXPRESdb: a database to compare gene coexpression in seven model animals. Nucleic Acids Res. 39:D1016–22.
13. Wass MN, Kelley LA, Sternberg MJE (2010) 3DLigandSite: predicting ligand-binding sites using similar structures. Nucleic Acids Res. 38:W469–W473.

Mark Wass