Wasslab - Computational Biology

Wass Group - Computational Biology
School of Biosciences
University of Kent
Kent, UK

Research

Research in the Wass group considers two main elements:

1) Development of computational methods to analyse and model biological data

2) Using computational biology to address important biological questions

1. Development of computational biology methods

Method development in the group has a basis in structural bioinformatics, often combining this with machine learning. The advent of high thrroughput technologies such as next generation sequencing have results in large volumes of data that are not characterised. For example the UniProt protein seuqence databse currently contains 148 million protien sequence but for most of these proteins their structure and function is unknown. Similarly with the sequencing of many individuals we now have extensive knowledge of genetic variants that occur in people but for most of these variants we do not know if they have a functional effect.

Methods developed in the Wass group seek to address these problems, to enahnce our understanding of protein structure and function and to identify genetic variants that have functional effects and are linked to phenotypes (such as disease). Such as protein sequences and With ever increasing biological data, such as protein sequences and genetic variants, that have been identified as a result of the rapid increase in sequencing of species and individuals, there aThe main projects in the group are listed below

Modelling small molecule binding sites in proteins

Knowledge of the location of ligand binding sites (such as active sites or cofactor binding sites) are important to aid our understanding of proteins. We have developed the 3DLigandSite to address this problem - Wass et al., 2010, Nucleic Acid Res,38, W469–W473 Users can submit either a protein sequence or a structure to 3DligandSite. Where a sequence is submitted the first step is to model the protein structure using Phyre2. 3DLigandSite identifies structures present in the protein databank that are homologous to the query protein that have bound ligands. These ligands are superimposed onto the onto the structural model of the query protein and used to predict the binding site. The method was developed based on our successful predictions in the CASP8 (Critical Assessmet of protein Structure Prediction) - Wass and Sternberg, M.J. (2009) Proteins, 77 Suppl 9:147-51

3DLigandSite binding site predictions are being incorporated into the PDBe as part of the BBSRC funded FunPDBe project.

Inferring protein function

Less than 1% of the 148M proteins in UniProt have experimentally characterised functions that are recorded in the Gene Ontology. We have therefore developed computational methods to infer protein function. CombFunc is a machine learning approach that combines features/data from multiple sources to infer protein function and it includes Confunc, our original conservation based method for inferring protein function. Both methods have performed well in the international critical assessment of functional annotation (CAFA), with ConFunc ranked 4th in CAFA1 for prediction of Eukaryotic protein function and CombFunc ranked in the top 10 methods - CAFA2 assessment paper

Modelling protein structure proteins

Protein structures have been solved for even fewer proteins than those with annotated functions. Modelling protein structure is therefore an important task and we have been involved in the development of the Phyre2 webserver.

Predicting the effect of single nucleotide variants

The 1000 Genomes project identified that each of us has between 4-5 million genetic variants that differ from the reference genome. It is now important to identify those that have a functional effect and result in phenotype, especially those that are associated with disease. To address this we developed VarMod a machine learning based method for predicting if non-synonymous single nucleotide variants are functional. VarMod uses structural modelling and analysis of protein-protein interfaces and protin-ligand binding sites to identify SNVs that have functional effects. This builds upon our research demonstrating that disease associated SNVs frequently occur at protien-protein interfaces - David et al., 2012, Human Mutation,33, 359–363

2. Using computational biology to address important biological questions

Idenityfing molecular determinant of virus pathogenicity

Over the past few years we have been interested in identifying the molecular determinants of Ebola virus pathogenicity. This work was driven by the 2013-16 Ebola virus outbreak in West Africa, which resulted in more than 28,000 cases and 11,000 deaths. The virus was also widely sequenced during this outbreak, making our research possible. We have focussed on comparison of Reston virus, the only species of Ebolavirus that does not cause disease in humans, with the four species that are known to cause disease. Our work has identified a small set of amino acid differences between these species that we propose are responsible for the difference in pathogenicity. Our main hypothesis is that differences in the protein VP24 are critical to determining host-specific pathogenicity. Our original study (Pappalardo et al., 2016) used the 196 genome sequences that were available at the time. Our findings were also supported by molecular dynamics simulations of VP24 (Pappalardo et al., 2017). We have recently updated (Martell, Masterson et al., 2019) our analysis using more than 1,400 genome sequences and found that our results were reproduced with this much larger dataset, providing confidence that our approach is robust even with the small number of sequences originally used.

A new species of Ebolavirus, Bombali virus, was identified in August 2018. We have investigated whether this species causes disease in humans (Martell, Masterson et al., 2019). Important positions in VP24 agree with the amino acid present in Reston virus, therefore it is possible that Bombali virus does not cause disease in humans. Additionally we have also investigated the mutations that occur during rodent adaptation studies to Ebola virus. Ebola virus does not normally cause disease in rodents but pathogenicitiy can be induced through serial passaging of the virus in rodents. Our analysis identified that a small number of mutations are likely required to indcue pathogenicity of Ebola virus in a new host (Pappalardo et al., 2016). Further, important mutations occur in VP24, linking with our original work comparing Reston virus with the other Ebolaviruses. If only a small number of mutations are required to make Reston virus pathogenic in humans, then this could pose a significant public health risk, given that Reston virus circulates in pigs in Asia.

The application of our approach to Ebolaviruses has proved successful and we are therefore beginning to apply it different virus families where different species echibit different phenotypes.

Studying cancer cell evolution to understand acquired drug resistance

Drug resistance is a common problem during cancer treatment, often a tumour initially responds to treatment with a drug only for the tumour cells to evolve over time making them resistant to further treatment with the same drug. In this work we collaborate extensively with Prof Martin Michaelis using the Resistance Cancer Cell Line Collection (RCCL), a collection of more than 1500 cancer cell lines that have been adapted to anti-cancer drugs and which we use as a model to study the mechanisms of acquired drug resistance and identify biomarkers of drug sensitivity/resistance. We combine omics data (including exome sequencing and transcriptomics) with screening of the cell lines against a panel of drugs to compare the parental cell lines with their drug resistant sub lines.

Protein evolution - Adaptation of myosin with increasing body size

Analaysis of genetic variation

Martell et al., 2017 - this is the cystinuria paper

Identifying the functions present in the minimal bacterial genome

Identifying molecular determinants David et al., 2012, Human Mutation,33, 359–363

Analysis of disease causing non synonymous SNPs
In recent yeasrs genome wide association studies have identified many SNPs that may be associated with diease. These studies tend to identify regions of the genome that linked to disease and therefore it is posisble that many SNPs in such regions may be associated with disease. With collaborators at Imperial College London we focussed on identifying functional effects of the disease associated SNPs identified in genome wide association studies using structual modelling and the prediction of ligand binding sites. Our interests are now in using genomic features for to assess the functional affect of SNPs and the application to the understanding of disease and its treatment. (Refs Chambers et al., Nature Genetics, 2009;2009;2010;2011).