Significance of Research
This research proposal would assist in dissecting the relationship between diseases and protein-protein interactions, and structural alterations with special emphasis on evolutionary linkage. Hence, new functions could be assigned to unknown proteins that would be recently discovered. Since structural genomics is gaining importance, protein structures with unknown functions would provide vital clues and could lessen the search space required by docking algorithms to predict the structures of complexes. This could also aid in the detection of novel protein-protein binding sites which could become promising candidates for drug discovery.
Diseases that have an association with mutations could be better understood with the utility of Bayesian network based on structural or evolutionary data (Chris et al., 2006). This network would also enable us to gain insights into the association of cellular pathways with protein functional sites. The association between gene polymorphisms and diseases may also be better understood by making single nucleotide polymorphisms (SNP) analysis using bioinformatic tools(Mathew et al., 2007). This proposal would also provide significant information on the protein alterations in response to various environmental stimuli in plants.
Review of the Background of the Study
The area of proteomics has received widespread research attention since the completion of the Human Genome Mapping Project. Earlier understanding the protein function has been a challenging task due to the lack of reliable databases. The intervention of structural genomics projects has made this job relatively easier through automated predictors as they were believed to generate useful data on protein structures with unknown functions (James, David, & Westhead, 2005). Using a combination approach of support vector machine (SVM) approach with surface patch analysis, researchers were able to predict protein-protein binding (James, David, & Westhead, 2005). Although this approach has facilitated Bioinformaticians to study the interface between two interacting proteins, further studies are required.
The Bayesian approach was best considered as the reliable platform for determining the structural similarity of functional sites from proteins with unknown functions using the database of known protein functions (Vysul, Kanti & David, 2005). Since structural similarity was considered as a theoretic problem, algorithms such as EM and MCMC were reported to be used for this approach (Vysul, Kanti & David, 2005). Therefore, predicting the functional sites of proteins from their 3D structure may need the assistance of algorithms in Bioinformatics.
Advancements in computational biology have contributed to the emergence of Bayesian networking that aims at integrating biological data and inferring cellular networks and pathways (Chris et al., 2006). This system has enabled a better understanding of functional proteomics by providing clues. This was revealed from a case study on the MOg1p family when a binding site involved in the signal transduction pathway was detected (James et al., 2006). Similarly, novel binding sites were also identified on two structural genomics targets, and a protein involved in the photosystem II complex in higher plants. Another effort was that the researchers were able to distinguish between the two types of the binding site, obligate and non-obligate, within the available dataset using a second Bayesian network (James et al., 2006). This report has indicated that the Bayesian network method could aid in the prediction of previously uncharacterized binding sites and the location of non-obligate binding sites and vice versa. Further, the intervention of Bayesian networks has made easier the task of predicting the functional consequences of missense mutations (Chris et al., 2006). This success could be attributed lonely to the availability of structural or evolutionary information which could better serve as predictors. However, structural information was considered more significant than evolutionary information. The classification of protein sequences was another achievement by the TMB hunt program that depends on a modified k-nearest neighbor (k-NN) algorithm (Andrew, Alison & David, 2005).
This would enable us to distinguish between transmembrane b-barrel (TMB) or non-TMB proteins on the basis of whole sequence amino acid composition. This TMB –Hunt webserver was believed to allow screening of up to 10 000 sequences in a single query and provides results and key statistics in a simple color-coded format (Andrew, Alison & David, 2005). However, the utility of the TMB Hunt program requires further confirmation as the TMB proteins are the least well characterized although they have multiple evolutionary origins (Andrew, Alison & David, 2005). Next, proteins function was reported to get altered due to single nucleotide polymorphisms (SNPs) (Mathew et al., 2007). These abnormalities are considered deleterious and may be likely to impair a vital biological function leading to a genetic disease. With the help of various data sets, earlier workers have predicted the target data in human SNPs such that it could influence the mutagenic studies (Mathew et al., 2007). As SNPs were reported to cause amino acid substitutions known as non-synonymous SNPs (nsSNPs), their study may appear important (Mathew et al., 2007). Bioinformatic tools have facilitated the identification of plant (Arabidopsis thaliana), GATA binding proteins during the studies on light-responsive promoters(Iain et al., 2007). This has enabled the sequences matching the theGATAAGGmotif the as the binding site for the fungal GATA-binding proteins AreA, Nit2, and the vertebrate GATA transcription factors (Iain et al., 2007). This finding was to determine the function of genes that are developmentally light-regulated, expressed predominantly in photosynthetic tissue, and when the transcription is optimum before sunrise (Iain et al., 2007). Previously, it was described that Arabidopsis Co-Expression Tool (ACT) would enable the identification of genes whose expression patterns are correlated across selected experiments or the complete data set (Chih-Hung Jen et al., 2006). In addition, CLIQUE FINDER was also proved to be helpful for determining the genes most likely to be regulated in a similar manner. These tools were reported to provide valuable information on genes encoding functionally related proteins, for example, 80s ribosomal proteins in Arabidopsis (Chih-Hung Jen et al., 2006).
Hence, the Bioinformatic analysis of the Arabidopsis plant would furnish better insights into its proteome. The identification of important functional domains in the protein structure has become feasible with bioinformatic models known as “Profile models” (John et al., 2007). These constitute the frequency of amino acids observed at each position in multiple alignments of protein sequences taken from a range of organisms (John et al., 2007). These are mainly used for metabolic construction and analysis of parasitic genomes. Protein motifs involved in the cell wall lysis have been reported to be identified through genetic analysis using computational tools (Mariam Taib et al., 2005). This was revealed when the researchers have identified sub-families of three and eight genes encoding proteins closely related to ChiA1 (chitinase) or ChiB1 in the Aspergillus fumigatus genome (Mariam Taib et al., 2005).
Specific Aims
With the available background information, we would like to submit a research proposal whose aims are: 1) to explore the bioinformatic tools and databases (structural or genetic) currently available on the World Wide Web (WWW) and 2) to predict the function of protein domains or sites using the information from the databases.
Research Methods
Protein Data Bank (PDB) will be accessed to select protein complexes using NCBI or MEDLINE. Proteins that share more than 20% sequence homology with the latest determined structure of the same complex will be excluded from the study.
Solvent excluded surface (SES) of each individual protein and complex will be calculated with MSMS surface generation program. For obtaining an efficient inference and learning, a directed acyclic graph (DAG) will be plotted using the Bayesian Network Toolbox for Matlab (BNT).
The Expectation-Maximisation (EM) algorithm and the joint probability distribution of the naïve Bayesian network will be used. Patch information that reflects the interacting and non-interacting part of the protein surface will be considered for obtaining the expected number of successful predictions by chance (James et al., 2006). The surface vertexes within a given patch will be calculated to generate support vector machine (SVM) attributes which will be later used for prediction using SVM software. The final selection of patches will be based on confidence values assigned to each patch by the SVM (James & David, 2005). For obtaining the information on mutation-induced protein alterations, a total of fourteen variables will be used to predict whether or not a missense mutation affects protein function. Mutations will be classified as effect and no effect based on the degree of loss of function. This will be facilitated by following naïve and learned types of Bayesian network structure (Chris et al., 2006). A concomitant SNP analysis will be also made to predict the function of deleterious genes encoding protein motifs in humans as described previously (Mathew et al., 2007).
In another experiment, genes encoding proteins, plant GATA transcription factors will be analyzed. For this purpose, Arabidopsis plants will be grown in compost: sand: perlite medium containing the suitable insecticide intercept in a glasshouse at 22_C without supplementary Light. Seedlings will be exposed to different conditions like light: dark, circadian patterns and will be harvested into liquid nitrogen at 4-h intervals. Total RNA will be purified and determined by UV absorbance spectrophotometry. Quantitative polymerase reaction (PCR) will be performed by converting RNA to cDNA using an oligo(dT) primer and Moloney murine leukemia virus reverse transcriptase and cDNA concentration will be measured and their products will be further amplified by ligating into pGEM T-easy cloning vector. The regions of GATA genes with similarity to other regions of the genome will be identified by BLAST analysis with an E value of 1.0.PCR will be performed. Similarly, DNA sequence comparisons will be performed using BLAST programs and the sets of genes showing the strongest co-expression with each GATA factor will be obtained using ACT (www.arabidopsis.Leeds.ac.uk/ACT). The Probe set identifications for GATA factors and genes co-expressed with each GATA factor will be finally pasted into the Genevestigator MetaAnalyzer tool in order to identify tissues where each gene is expressed (Iain et al.,2007).
References
Andrew G Garrow, Alison Agnew, David R Westhead. “TMB-Hunt: a web server to screen sequence sets for transmembrane b-barrel proteins.” Nucleic Acids Research, 33 (2005): W188-192.
Chih-Hung Jen, Iain W Manfield, Ioannis Michalopoulos, John W Pinney, William GT Willats, Philip M Gilmartin, David R Westhead. “The Arabidopsis co-expression tool (ACT): a WWW-based tool and database for microarray-based gene expression analysis.” The Plant Journal 46 (2006): 336–348.
Chris J Needham, James R Bradford, Andrew J Bulpitt, David R Westhead. “Inference in Bayesian networks.” Nature Biotechnology 24.1(2006):51-54.
Chris J Needham, James R Bradford, Andrew J Bulpitt, Mathew A Care, David R Westhead. “Inference in Bayesian networks.” BMC Bioinformatics 7 (2006):405.
Iain W Manfield, Paul, F, Devlin, Chih-Hung Jen, David R Westhead, Philip M Gilmartin. Conservation, Convergence, and Divergence of Light-Responsive, Circadian-Regulated, and Tissue-Specific Expression Patterns during Evolution of the Arabidopsis GATA Gene Family. Plant Physiology 143 (2007): 941–958
James R Bradford & David R Westhead. “Improved prediction of protein–protein binding sites using a support vector machines approach.” Bioinformatics 21.8 (2005):1487-1494.
James R. Bradford, Chris J Needham, Andrew J Bulpitt, David R Westhead. “Insights into protein-protein interfaces using a Bayesian Network Prediction Method.” J Mol Biol 362(2006): 365–386.
John W Pinney, Balazs Papp, Christopher Hyland, Lillian Wambua, David R Westhead and Glenn A McConkey. Metabolic reconstruction and analysis for parasite genomes. Trends in Parasitology 23.11(2007):548-554.
Mariam Taib, John W Pinney, David R Westhead, Kenneth J McDowell, David J Adams. “Differential expression and extent of fungal/plant and fungal/bacterial chitinases of Aspergillus fumigatus.”Arch Microbiol 184 (2005): 78–81
Matthew A Care, Chris J. Needham, Andrew J. Bulpitt David R. Westhead. “Deleterious SNP prediction: be mindful of your training data!” Bioinformatics 23. 6 (2007), 664–672.
Vysaul Nyirongo, Kanti V Mardia, David R Westhead. “EM algorithm, Bayesian and distance approaches to matching functional sites.” (2005):155-156. Web.