Catalogue Search | MBRL

Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods

by Dessimoz, Christophe , Altenhoff, Adrian M. in Animals , Computational biology , Computational Biology/Comparative Sequence Analysis

2009

Accurate genome-wide identification of orthologs is a central problem in comparative genomics, a fact reflected by the numerous orthology identification projects developed in recent years. However, only a few reports have compared their accuracy, and indeed, several recent efforts have not yet been systematically evaluated. Furthermore, orthology is typically only assessed in terms of function conservation, despite the phylogeny-based original definition of Fitch. We collected and mapped the results of nine leading orthology projects and methods (COG, KOG, Inparanoid, OrthoMCL, Ensembl Compara, Homologene, RoundUp, EggNOG, and OMA) and two standard methods (bidirectional best-hit and reciprocal smallest distance). We systematically compared their predictions with respect to both phylogeny and function, using six different tests. This required the mapping of millions of sequences, the handling of hundreds of millions of predicted pairs of orthologs, and the computation of tens of thousands of trees. In phylogenetic analysis or in functional analysis where high specificity is required, we find that OMA and Homologene perform best. At lower functional specificity but higher coverage level, OrthoMCL outperforms Ensembl Compara, and to a lesser extent Inparanoid. Lastly, the large coverage of the recent EggNOG can be of interest to build broad functional grouping, but the method is not specific enough for phylogenetic or detailed function analyses. In terms of general methodology, we observe that the more sophisticated tree reconstruction/reconciliation approach of Ensembl Compara was at times outperformed by pairwise comparison approaches, even in phylogenetic tests. Furthermore, we show that standard bidirectional best-hit often outperforms projects with more complex algorithms. First, the present study provides guidance for the broad community of orthology data users as to which database best suits their needs. Second, it introduces new methodology to verify orthology. And third, it sets performance standards for current and future approaches.

Journal Article

Share this book

Add to My Shelf

Assessing Performance of Orthology Detection Strategies Applied to Eukaryotic Genomes

by Mackey, Aaron J. , Chen, Feng , Vermunt, Jeroen K. in Accuracy , Algorithms , Apis mellifera

2007

Orthology detection is critically important for accurate functional annotation, and has been widely used to facilitate studies on comparative and evolutionary genomics. Although various methods are now available, there has been no comprehensive analysis of performance, due to the lack of a genomic-scale 'gold standard' orthology dataset. Even in the absence of such datasets, the comparison of results from alternative methodologies contains useful information, as agreement enhances confidence and disagreement indicates possible errors. Latent Class Analysis (LCA) is a statistical technique that can exploit this information to reasonably infer sensitivities and specificities, and is applied here to evaluate the performance of various orthology detection methods on a eukaryotic dataset. Overall, we observe a trade-off between sensitivity and specificity in orthology detection, with BLAST-based methods characterized by high sensitivity, and tree-based methods by high specificity. Two algorithms exhibit the best overall balance, with both sensitivity and specificity>80%: INPARANOID identifies orthologs across two species while OrthoMCL clusters orthologs from multiple species. Among methods that permit clustering of ortholog groups spanning multiple genomes, the (automated) OrthoMCL algorithm exhibits better within-group consistency with respect to protein function and domain architecture than the (manually curated) KOG database, and the homolog clustering algorithm TribeMCL as well. By way of using LCA, we are also able to comprehensively assess similarities and statistical dependence between various strategies, and evaluate the effects of parameter settings on performance. In summary, we present a comprehensive evaluation of orthology detection on a divergent set of eukaryotic genomes, thus providing insights and guides for method selection, tuning and development for different applications. Many biological questions have been addressed by multiple tests yielding binary (yes/no) outcomes but no clear definition of truth, making LCA an attractive approach for computational biology.

Journal Article

Share this book

Add to My Shelf

A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation

by Eddy, Sean R. in Algorithms , Base Sequence , Chromosome Mapping - methods

2008

Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (lambda) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (\"Forward\" scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (\"Viterbi\" scores) are Gumbel-distributed with constant lambda = log 2, and the high scoring tail of Forward scores is exponential with the same constant lambda. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments.

Journal Article

Share this book

Add to My Shelf

Detection of Large Numbers of Novel Sequences in the Metatranscriptomes of Complex Marine Microbial Communities

by Joint, Ian , Li, Weizhong , Gilbert, Jack A. in Analysis , Animals , Base pairs

2008

Sequencing the expressed genetic information of an ecosystem (metatranscriptome) can provide information about the response of organisms to varying environmental conditions. Until recently, metatranscriptomics has been limited to microarray technology and random cloning methodologies. The application of high-throughput sequencing technology is now enabling access to both known and previously unknown transcripts in natural communities. We present a study of a complex marine metatranscriptome obtained from random whole-community mRNA using the GS-FLX Pyrosequencing technology. Eight samples, four DNA and four mRNA, were processed from two time points in a controlled coastal ocean mesocosm study (Bergen, Norway) involving an induced phytoplankton bloom producing a total of 323,161,989 base pairs. Our study confirms the finding of the first published metatranscriptomic studies of marine and soil environments that metatranscriptomics targets highly expressed sequences which are frequently novel. Our alternative methodology increases the range of experimental options available for conducting such studies and is characterized by an exceptional enrichment of mRNA (99.92%) versus ribosomal RNA. Analysis of corresponding metagenomes confirms much higher levels of assembly in the metatranscriptomic samples and a far higher yield of large gene families with >100 members, approximately 91% of which were novel. This study provides further evidence that metatranscriptomic studies of natural microbial communities are not only feasible, but when paired with metagenomic data sets, offer an unprecedented opportunity to explore both structure and function of microbial communities--if we can overcome the challenges of elucidating the functions of so many never-seen-before gene families.

Journal Article

Share this book

Add to My Shelf

Evolutionary Descent of Prion Genes from the ZIP Family of Metal Ion Transporters

by Schmitt-Ulms, Gerold , Watts, Joel C. , Ehsani, Sepehr in Amino acid sequence , Amino acids , Animals

2009

In the more than twenty years since its discovery, both the phylogenetic origin and cellular function of the prion protein (PrP) have remained enigmatic. Insights into a possible function of PrP may be obtained through the characterization of its molecular neighborhood in cells. Quantitative interactome data demonstrated the spatial proximity of two metal ion transporters of the ZIP family, ZIP6 and ZIP10, to mammalian prion proteins in vivo. A subsequent bioinformatic analysis revealed the unexpected presence of a PrP-like amino acid sequence within the N-terminal, extracellular domain of a distinct sub-branch of the ZIP protein family that includes ZIP5, ZIP6 and ZIP10. Additional structural threading and orthologous sequence alignment analyses argued that the prion gene family is phylogenetically derived from a ZIP-like ancestral molecule. The level of sequence homology and the presence of prion protein genes in most chordate species place the split from the ZIP-like ancestor gene at the base of the chordate lineage. This relationship explains structural and functional features found within mammalian prion proteins as elements of an ancient involvement in the transmembrane transport of divalent cations. The phylogenetic and spatial connection to ZIP proteins is expected to open new avenues of research to elucidate the biology of the prion protein in health and disease.

Journal Article

Share this book

Add to My Shelf

The Compartmentalized Bacteria of the Planctomycetes-Verrucomicrobia-Chlamydiae Superphylum Have Membrane Coat-Like Proteins

by Gorjanacz, Matyas , Bauer, Ulrike , Franke, Josef in Bacteria , Bacteria - classification , Bacteria - cytology

2010

The development of the endomembrane system was a major step in eukaryotic evolution. Membrane coats, which exhibit a unique arrangement of beta-propeller and alpha-helical repeat domains, play key roles in shaping eukaryotic membranes. Such proteins are likely to have been present in the ancestral eukaryote but cannot be detected in prokaryotes using sequence-only searches. We have used a structure-based detection protocol to search all proteomes for proteins with this domain architecture. Apart from the eukaryotes, we identified this protein architecture only in the Planctomycetes-Verrucomicrobia-Chlamydiae (PVC) bacterial superphylum, many members of which share a compartmentalized cell plan. We determined that one such protein is partly localized at the membranes of vesicles formed inside the cells in the planctomycete Gemmata obscuriglobus. Our results demonstrate similarities between bacterial and eukaryotic compartmentalization machinery, suggesting that the bacterial PVC superphylum contributed significantly to eukaryogenesis.

Journal Article

Share this book

Add to My Shelf

Comparative Genomic Evidence for a Complete Nuclear Pore Complex in the Last Eukaryotic Common Ancestor

by Lundin, Daniel , Neumann, Nadja , Poole, Anthony M. in Anchoring , Ascomycetes , Bacteria

2010

The Nuclear Pore Complex (NPC) facilitates molecular trafficking between nucleus and cytoplasm and is an integral feature of the eukaryote cell. It exhibits eight-fold rotational symmetry and is comprised of approximately 30 nucleoporins (Nups) in different stoichiometries. Nups are broadly conserved between yeast, vertebrates and plants, but few have been identified among other major eukaryotic groups. We screened for Nups across 60 eukaryote genomes and report that 19 Nups (spanning all major protein subcomplexes) are found in all eukaryote supergroups represented in our study (Opisthokonts, Amoebozoa, Viridiplantae, Chromalveolates and Excavates). Based on parsimony, between 23 and 26 of 31 Nups can be placed in LECA. Notably, they include central components of the anchoring system (Ndc1 and Gp210) indicating that the anchoring system did not evolve by convergence, as has previously been suggested. These results significantly extend earlier results and, importantly, unambiguously place a fully-fledged NPC in LECA. We also test the proposal that transmembrane Pom proteins in vertebrates and yeasts may account for their variant forms of mitosis (open mitoses in vertebrates, closed among yeasts). The distribution of homologues of vertebrate Pom121 and yeast Pom152 is not consistent with this suggestion, but the distribution of fungal Pom34 fits a scenario wherein it was integral to the evolution of closed mitosis in ascomycetes. We also report an updated screen for vesicle coating complexes, which share a common evolutionary origin with Nups, and can be traced back to LECA. Surprisingly, we find only three supergroup-level differences (one gain and two losses) between the constituents of COPI, COPII and Clathrin complexes. Our results indicate that all major protein subcomplexes in the Nuclear Pore Complex are traceable to the Last Eukaryotic Common Ancestor (LECA). In contrast to previous screens, we demonstrate that our conclusions hold regardless of the position of the root of the eukaryote tree.

Journal Article

Share this book

Add to My Shelf

Exploration of Uncharted Regions of the Protein Universe

by Krishna, S. Sri , Wilson, Ian A. , Li, Zhanwen in Animals , Biochemistry/Bioinformatics , Biochemistry/Molecular Evolution

2009

The genome projects have unearthed an enormous diversity of genes of unknown function that are still awaiting biological and biochemical characterization. These genes, as most others, can be grouped into families based on sequence similarity. The PFAM database currently contains over 2,200 such families, referred to as domains of unknown function (DUF). In a coordinated effort, the four large-scale centers of the NIH Protein Structure Initiative have determined the first three-dimensional structures for more than 250 of these DUF families. Analysis of the first 248 reveals that about two thirds of the DUF families likely represent very divergent branches of already known and well-characterized families, which allows hypotheses to be formulated about their biological function. The remainder can be formally categorized as new folds, although about one third of these show significant substructure similarity to previously characterized folds. These results infer that, despite the enormous increase in the number and the diversity of new genes being uncovered, the fold space of the proteins they encode is gradually becoming saturated. The previously unexplored sectors of the protein universe appear to be primarily shaped by extreme diversification of known protein families, which then enables organisms to evolve new functions and adapt to particular niches and habitats. Notwithstanding, these DUF families still constitute the richest source for discovery of the remaining protein folds and topologies.

Journal Article

Share this book

Add to My Shelf

Networks of High Mutual Information Define the Structural Proximity of Catalytic Sites: Implications for Catalytic Residue Identification

by Delfino, José María , Di Doménico, Tomas , Marino Buslje, Cristina in Amino acids , Area Under Curve , Biochemistry

2010

Identification of catalytic residues (CR) is essential for the characterization of enzyme function. CR are, in general, conserved and located in the functional site of a protein in order to attain their function. However, many non-catalytic residues are highly conserved and not all CR are conserved throughout a given protein family making identification of CR a challenging task. Here, we put forward the hypothesis that CR carry a particular signature defined by networks of close proximity residues with high mutual information (MI), and that this signature can be applied to distinguish functional from other non-functional conserved residues. Using a data set of 434 Pfam families included in the catalytic site atlas (CSA) database, we tested this hypothesis and demonstrated that MI can complement amino acid conservation scores to detect CR. The Kullback-Leibler (KL) conservation measurement was shown to significantly outperform both the Shannon entropy and maximal frequency measurements. Residues in the proximity of catalytic sites were shown to be rich in shared MI. A structural proximity MI average score (termed pMI) was demonstrated to be a strong predictor for CR, thus confirming the proposed hypothesis. A structural proximity conservation average score (termed pC) was also calculated and demonstrated to carry distinct information from pMI. A catalytic likeliness score (Cls), combining the KL, pC and pMI measures, was shown to lead to significantly improved prediction accuracy. At a specificity of 0.90, the Cls method was found to have a sensitivity of 0.816. In summary, we demonstrate that networks of residues with high MI provide a distinct signature on CR and propose that such a signature should be present in other classes of functional residues where the requirement to maintain a particular function places limitations on the diversification of the structural environment along the course of evolution.

Journal Article

Share this book

Add to My Shelf

Mapping Protein Interactions between Dengue Virus and Its Human and Insect Hosts

by Gomez, Shawn M. , Doolittle, Janet M. in Aedes - virology , Animals , Biochemistry/Structural Genomics

2011

Dengue fever is an increasingly significant arthropod-borne viral disease, with at least 50 million cases per year worldwide. As with other viral pathogens, dengue virus is dependent on its host to perform the bulk of functions necessary for viral survival and replication. To be successful, dengue must manipulate host cell biological processes towards its own ends, while avoiding elimination by the immune system. Protein-protein interactions between the virus and its host are one avenue through which dengue can connect and exploit these host cellular pathways and processes. We implemented a computational approach to predict interactions between Dengue virus (DENV) and both of its hosts, Homo sapiens and the insect vector Aedes aegypti. Our approach is based on structural similarity between DENV and host proteins and incorporates knowledge from the literature to further support a subset of the predictions. We predict over 4,000 interactions between DENV and humans, as well as 176 interactions between DENV and A. aegypti. Additional filtering based on shared Gene Ontology cellular component annotation reduced the number of predictions to approximately 2,000 for humans and 18 for A. aegypti. Of 19 experimentally validated interactions between DENV and humans extracted from the literature, this method was able to predict nearly half (9). Additional predictions suggest specific interactions between virus and host proteins relevant to interferon signaling, transcriptional regulation, stress, and the unfolded protein response. Dengue virus manipulates cellular processes to its advantage through specific interactions with the host's protein interaction network. The interaction networks presented here provide a set of hypothesis for further experimental investigation into the DENV life cycle as well as potential therapeutic targets.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter