Catalogue Search | MBRL

by Steel, Mike , McMahon, Michelle M. , Sanderson, Michael J. in Algorithms , Animals , Arthropods - classification

2011

A key step in assembling the tree of life is the construction of species-rich phylogenies from multilocus—but often incomplete—sequence data sets. We describe previously unknown structure in the landscape of solutions to the tree reconstruction problem, comprising sometimes vast \"terraces\" of trees with identical quality, arranged on islands of phylogenetically similar trees. Phylogenetic ambiguity within a terrace can be characterized efficiently and then ameliorated by new algorithms for obtaining a terrace's maximum-agreement subtree or by identifying the smallest set of new targets for additional sequencing. Algorithms to find optimal trees or estimate Bayesian posterior tree distributions may need to navigate strategically in the neighborhood of large terraces in tree space.

Journal Article

Share this book

Add to My Shelf

Phylogenomics with incomplete taxon coverage: the limits to inference

by Steel, Mike , McMahon, Michelle M , Sanderson, Michael J in Animal Systematics/Taxonomy/Biogeography , Biomedical and Life Sciences , Computational Biology - methods

2010

Background Phylogenomic studies based on multi-locus sequence data sets are usually characterized by partial taxon coverage, in which sequences for some loci are missing for some taxa. The impact of missing data has been widely studied in phylogenetics, but it has proven difficult to distinguish effects due to error in tree reconstruction from effects due to missing data per se. We approach this problem using a explicitly phylogenomic criterion of success, decisiveness , which refers to whether the pattern of taxon coverage allows for uniquely defining a single tree for all taxa. Results We establish theoretical bounds on the impact of missing data on decisiveness. Results are derived for two contexts: a fixed taxon coverage pattern, such as that observed from an already assembled data set, and a randomly generated pattern derived from a process of sampling new data, such as might be observed in an ongoing comparative genomics sequencing project. Lower bounds on how many loci are needed for decisiveness are derived for the former case, and both lower and upper bounds for the latter. When data are not decisive for all trees, we estimate the probability of decisiveness and the chances that a given edge in the tree will be distinguishable. Theoretical results are illustrated using several empirical examples constructed by mining sequence databases, genomic libraries such as ESTs and BACs, and complete genome sequences. Conclusion Partial taxon coverage among loci can limit phylogenomic inference by making it impossible to distinguish among multiple alternative trees. However, even though lack of decisiveness is typical of many sparse phylogenomic data sets, it is often still possible to distinguish a large fraction of edges in the tree.

Journal Article

Share this book

Add to My Shelf

Extensive gene tree discordance and hemiplasy shaped the genomes of North American columnar cacti

by Charboneau, Joseph L. M. , Wojciechowski, Martin F. , Childs, Kevin L. in Base Sequence , Biological evolution , Biological Sciences

2017

Few clades of plants have proven as difficult to classify as cacti. One explanation may be an unusually high level of convergent and parallel evolution (homoplasy). To evaluate support for this phylogenetic hypothesis at the molecular level, we sequenced the genomes of four cacti in the especially problematic tribe Pachycereeae, which contains most of the large columnar cacti of Mexico and adjacent areas, including the iconic saguaro cactus (Carnegiea gigantea) of the Sonoran Desert. We assembled a high-coverage draft genome for saguaro and lower coverage genomes for three other genera of tribe Pachycereeae (Pachycereus, Lophocereus, and Stenocereus) and a more distant outgroup cactus, Pereskia. We used these to construct 4,436 orthologous gene alignments. Species tree inference consistently returned the same phylogeny, but gene tree discordance was high: 37% of gene trees having at least 90% bootstrap support conflicted with the species tree. Evidently, discordance is a product of long generation times and moderately large effective population sizes, leading to extensive incomplete lineage sorting (ILS). In the best supported gene trees, 58% of apparent homoplasy at amino sites in the species tree is due to gene tree-species tree discordance rather than parallel substitutions in the gene trees themselves, a phenomenon termed “hemiplasy.” The high rate of genomic hemiplasy may contribute to apparent parallelisms in phenotypic traits, which could confound understanding of species relationships and character evolution in cacti.

Journal Article

Share this book

Add to My Shelf

Phylogenetic Supermatrix Analysis of GenBank Sequences from 2228 Papilionoid Legumes

by Savolainan, Vincent , McMahon, Michelle M. , Sanderson, Michael J. in Algorithms , Alignment , Base Sequence

2006

A comprehensive phylogeny of papilionoid legumes was inferred from sequences of 2228 taxa in GenBank release 147. A semiautomated analysis pipeline was constructed to download, parse, assemble, align, combine, and build trees from a pool of 11,881 sequences. Initial steps included all-against-all BLAST similarity searches coupled with assembly, using a novel strategy for building length-homogeneous primary sequence clusters. This was followed by a combination of global and local alignment protocols to build larger secondary clusters of locally aligned sequences, thus taking into account the dramatic differences in length of the heterogeneous coding and noncoding sequence data present in GenBank. Next, clusters were checked for the presence of duplicate genes and other potentially misleading sequences and examined for combinability with other clusters on the basis of taxon overlap. Finally, two supermatrices were constructed: a “sparse” matrix based on the primary clusters alone (1794 taxa × 53,977 characters), and a somewhat more “dense” matrix based on the secondary clusters (2228 taxa × 33,168 characters). Both matrices were very sparse, with 95% of their cells containing gaps or question marks. These were subjected to extensive heuristic parsimony analyses using deterministic and stochastic heuristics, including bootstrap analyses. A “reduced consensus” bootstrap analysis was also performed to detect cryptic signal in a subtree of the data set corresponding to a “backbone” phylogeny proposed in previous studies. Overall, the dense supermatrix appeared to provide much more satisfying results, indicated by better resolution of the bootstrap tree, excellent agreement with the backbone papilionoid tree in the reduced bootstrap consensus analysis, few problematic large polytomies in the strict consensus, and less fragmentation of conventionally recognized genera. Nevertheless, at lower taxonomic levels several problems were identified and diagnosed. A large number of methodological issues in supermatrix construction at this scale are discussed, including detection of annotation errors in GenBank sequences; the shortage of effective algorithms and software for local multiple sequence alignment; the difficulty of overcoming effects of fragmentation of data into nearly disjoint blocks in sparse supermatrices; and the lack of informative tools to assess confidence limits in very large trees.

Journal Article

Share this book

Add to My Shelf

Impacts of Terraces on Phylogenetic Inference

by McMahon, Michelle M. , Sanderson, Michael J. , Stamatakis, Alexandros in Algorithms , Bayesian analysis , Classification - methods

2015

Terraces are sets of trees with precisely the same likelihood or parsimony score, which can be induced by missing sequences in partitioned multi-locus phylogenetic data matrices. The potentially large set of trees on a terrace can be characterized by enumeration algorithms or consensus methods that exploit the pattern of partial taxon coverage in the data, independent of the sequence data themselves. Terraces can add ambiguity and complexity to phylogenetic inference, particularly in settings where inference is already challenging: data sets with many taxa and relatively few loci. In this article we present five new findings about terraces and their impacts on phylogenetic inference. First, we clarify assumptions about partitioning scheme model parameters that are necessary for the existence of terraces. Second, we explore the dependence of terrace size on partitioning scheme and indicate how to find the partitioning scheme associated with the largest terrace containing a given tree. Third, we highlight the impact of terrace size on bootstrap estimates of confidence limits in clades, and characterize the surprising result that the bootstrap proportion for a clade, as it is usually calculated, can be entirely determined by the frequency of bipartitions on a terrace, with some bipartitions receiving high support even when incorrect. Fourth, we dissect some effects of prior distributions of edge lengths on the computed posterior probabilities of clades on terraces, to understand an example in which long edges \"attract\" each other in Bayesian inference. Fifth, we describe how assuming relationships between edge-lengths of different loci, as an attempt to avoid terraces, can also be problematic when taxon coverage is partial, specifically when heterotachy is present. Finally, we discuss strategies for remediation of some of these problems. One promising approach finds a minimal set of taxa which, when deleted from the data matrix, reduces the size of a terrace to a single tree.

Journal Article

Share this book

Add to My Shelf

STBase: One Million Species Trees for Comparative Biology

by McMahon, Michelle M. , Sanderson, Michael J. , Deepak, Akshay in Algorithms , Analysis , Assembling

2015

Comprehensively sampled phylogenetic trees provide the most compelling foundations for strong inferences in comparative evolutionary biology. Mismatches are common, however, between the taxa for which comparative data are available and the taxa sampled by published phylogenetic analyses. Moreover, many published phylogenies are gene trees, which cannot always be adapted immediately for species level comparisons because of discordance, gene duplication, and other confounding biological processes. A new database, STBase, lets comparative biologists quickly retrieve species level phylogenetic hypotheses in response to a query list of species names. The database consists of 1 million single- and multi-locus data sets, each with a confidence set of 1000 putative species trees, computed from GenBank sequence data for 413,000 eukaryotic taxa. Two bodies of theoretical work are leveraged to aid in the assembly of multi-locus concatenated data sets for species tree construction. First, multiply labeled gene trees are pruned to conflict-free singly-labeled species-level trees that can be combined between loci. Second, impacts of missing data in multi-locus data sets are ameliorated by assembling only decisive data sets. Data sets overlapping with the user's query are ranked using a scheme that depends on user-provided weights for tree quality and for taxonomic overlap of the tree with the query. Retrieval times are independent of the size of the database, typically a few seconds. Tree quality is assessed by a real-time evaluation of bootstrap support on just the overlapping subtree. Associated sequence alignments, tree files and metadata can be downloaded for subsequent analysis. STBase provides a tool for comparative biologists interested in exploiting the most relevant sequence data available for the taxa of interest. It may also serve as a prototype for future species tree oriented databases and as a resource for assembly of larger species phylogenies from precomputed trees.

Journal Article

Share this book

Add to My Shelf

Diversity and evolution of a trait mediating ant-plant interactions : insights from extrafloral nectaries in Senna (Leguminosae)

by Marazzi, Brigitte , Sanderson, Michael J , Bronstein, Judith L in anatomy & histology , Animals , Ants

2013

Background and AimsPlants display a wide range of traits that allow them to use animals for vital tasks. To attract and reward aggressive ants that protect developing leaves and flowers from consumers, many plants bear extrafloral nectaries (EFNs). EFNs are exceptionally diverse in morphology and locations on a plant. In this study the evolution of EFN diversity is explored by focusing on the legume genus Senna, in which EFNs underwent remarkable morphological diversification and occur in over 80 % of the approx. 350 species.MethodsEFN diversity in location, morphology and plant ontogeny was characterized in wild and cultivated plants, using scanning electron microscopy and microtome sectioning. From these data EFN evolution was reconstructed in a phylogenetic framework comprising 83 Senna species.Key ResultsTwo distinct kinds of EFNs exist in two unrelated clades within Senna. ‘Individualized’ EFNs (iEFNs), located on the compound leaves and sometimes at the base of pedicels, display a conspicuous, gland-like nectary structure, are highly diverse in shape and characterize the species-rich EFN clade. Previously overlooked ‘non-individualized’ EFNs (non-iEFNs) embedded within stipules, bracts, and sepals are cryptic and may represent a new synapomorphy for clade II. Leaves bear EFNs consistently throughout plant ontogeny. In one species, however, early seedlings develop iEFNs between the first pair of leaflets, but later leaves produce them at the leaf base. This ontogenetic shift reflects our inferred diversification history of iEFN location: ancestral leaves bore EFNs between the first pair of leaflets, while leaves derived from them bore EFNs either between multiple pairs of leaflets or at the leaf base.ConclusionsEFNs are more diverse than previously thought. EFN-bearing plant parts provide different opportunities for EFN presentation (i.e. location) and individualization (i.e. morphology), with implications for EFN morphological evolution, EFN–ant protective mutualisms and the evolutionary role of EFNs in plant diversification.

Journal Article

Share this book

Add to My Shelf

Prospects for Building the Tree of Life from Large Sequence Databases

by Burleigh, J. Gordon , McMahon, Michelle M. , Sanderson, Michael J. in Algorithms , Animals , Anopheles - classification

2004

We assess the phylogenetic potential of ~300,000 protein sequences sampled from Swiss-Prot and GenBank. Although only a small subset of these data was potentially phylogenetically informative, this subset retained a substantial fraction of the original taxonomic diversity. Sampling biases in the databases necessitate building phylogenetic data sets that have large numbers of missing entries. However, an analysis of two \"supermatrices\" suggests that even data sets with as much as 92% missing data can provide insights into broad sections of the tree of life.

Journal Article

Share this book

Add to My Shelf

EvoMiner: frequent subtree mining in phylogenetic databases

by Tirthapura, Srikanta , McMahon, Michelle M. , Sanderson, Michael J. in Agreements , Algorithmics. Computability. Computer arithmetics , Algorithms

2014

The problem of mining collections of trees to identify common patterns, called frequent subtrees (FSTs), arises often when trying to interpret the results of phylogenetic analysis. FST mining generalizes the well-known maximum agreement subtree problem. Here we present EvoMiner , a new algorithm for mining frequent subtrees in collections of phylogenetic trees. EvoMiner is an Apriori-like levelwise method, which uses a novel phylogeny-specific constant-time candidate generation scheme, an efficient fingerprinting-based technique for downward closure, and a lowest-common-ancestor-based support counting step that requires neither costly subtree operations nor database traversal. Our algorithm achieves speedups of up to 100 times or more over Phylominer, the current state-of-the-art algorithm for mining phylogenetic trees. EvoMiner can also work in depth-first enumeration mode to use less memory at the expense of speed. We demonstrate the utility of FST mining as a way to extract meaningful phylogenetic information from collections of trees when compared to maximum agreement subtrees and majority-rule trees—two commonly used approaches in phylogenetic analysis for extracting consensus information from a collection of trees over a common leaf set.

Journal Article

Share this book

Add to My Shelf

Phylogeny of Amorpheae (Fabaceae: Papilionoideae)

by McMahon, Michelle , Hufford, Larry in Amorpheae , Apoplanesia , Biological taxonomies

2004

The legume tribe Amorpheae comprises eight genera and 240 species with variable floral form. In this study, we inferred a phylogeny for Amorpheae using DNA sequence data from the plastid trnK intron, including matK, and the nuclear ribosomal ITS1, 5.8S, and ITS2. Our data resulted in a well-resolved phylogeny in which the tribe is divided into the daleoids (Dalea, Marina, and Psorothamnus), characterized by generally papilionaceous corollas, and the amorphoids (Amorpha, Apoplanesia, Errazurizia, Eysenhardtia, and Parryella), characterized by non-papilionaceous flowers. We found evidence for the paraphyly of Psorothamnus and for the monophyly of Dalea once D. filiciformis is transferred to monophyletic Marina. Errazurizia rotundata is more closely related to Amorpha than to the other errazurizias, and Eysenhardtia is supported to be monophyletic. The monotypic Parryella and Apoplanesia are placed within the amorphoids. Among Papilionoideae, trnK/matK sequence data provide strong evidence for the monophyly of Amorpheae and place Amorpheae as sister to the recently discovered dalbergioid clade.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter