Catalogue Search | MBRL

The Effects of Partitioning on Phylogenetic Inference

by Kainer, David , Lanfear, Robert in Best practice , Datasets , Inference

2015

Partitioning is a commonly used method in phylogenetics that aims to accommodate variation in substitution patterns among sites. Despite its popularity, there have been few systematic studies of its effects on phylogenetic inference, and there have been no studies that compare the effects of different approaches to partitioning across many empirical data sets. In this study, we applied four commonly used approaches to partitioning to each of 34 empirical data sets, and then compared the resulting tree topologies, branch-lengths, and bootstrap support estimated using each approach. We find that the choice of partitioning scheme often affects tree topology, particularly when partitioning is omitted. Most notably, we find occasional instances where the use of a suboptimal partitioning scheme produces highly supported but incorrect nodes in the tree. Branch-lengths and bootstrap support are also affected by the choice of partitioning scheme, sometimes dramatically so. We discuss the reasons for these effects and make some suggestions for best practice.

Journal Article

Share this book

Add to My Shelf

Selecting optimal partitioning schemes for phylogenomic datasets

by Stamatakis, Alexandros , Lanfear, Robert , Calcott, Brett in Algorithms , Analysis , Animal Systematics/Taxonomy/Biogeography

2014

Background Partitioning involves estimating independent models of molecular evolution for different subsets of sites in a sequence alignment, and has been shown to improve phylogenetic inference. Current methods for estimating best-fit partitioning schemes, however, are only computationally feasible with datasets of fewer than 100 loci. This is a problem because datasets with thousands of loci are increasingly common in phylogenetics. Methods We develop two novel methods for estimating best-fit partitioning schemes on large phylogenomic datasets: strict and relaxed hierarchical clustering. These methods use information from the underlying data to cluster together similar subsets of sites in an alignment, and build on clustering approaches that have been proposed elsewhere. Results We compare the performance of our methods to each other, and to existing methods for selecting partitioning schemes. We demonstrate that while strict hierarchical clustering has the best computational efficiency on very large datasets, relaxed hierarchical clustering provides scalable efficiency and returns dramatically better partitioning schemes as assessed by common criteria such as AICc and BIC scores. Conclusions These two methods provide the best current approaches to inferring partitioning schemes for very large datasets. We provide free open-source implementations of the methods in the PartitionFinder software. We hope that the use of these methods will help to improve the inferences made from large phylogenomic datasets.

Journal Article

Share this book

Add to My Shelf

Accelerating Climate Resilient Plant Breeding by Applying Next-Generation Artificial Intelligence

by Harfouche, Antoine H. , Scarascia Mugnozza, Giuseppe , Moshelion, Menachem in Accuracy , Adaptability , Agricultural production

2019

Breeding crops for high yield and superior adaptability to new and variable climates is imperative to ensure continued food security, biomass production, and ecosystem services. Advances in genomics and phenomics are delivering insights into the complex biological mechanisms that underlie plant functions in response to environmental perturbations. However, linking genotype to phenotype remains a huge challenge and is hampering the optimal application of high-throughput genomics and phenomics to advanced breeding. Critical to success is the need to assimilate large amounts of data into biologically meaningful interpretations. Here, we present the current state of genomics and field phenomics, explore emerging approaches and challenges for multiomics big data integration by means of next-generation (Next-Gen) artificial intelligence (AI), and propose a workable path to improvement. The integration of genomics and phenomics will speed the development of climate resilient crops; however, these omics technologies are generating large, heterogeneous, and complex data much faster than currently can be analyzed.First-generation AI is being used in surveying and classifying omics data; however, it is designed to solve well-defined tasks of single-omics datasets that do not require integration of data across multiple modalities.Next-generation AI can change the dynamics of how experiments are planned, thus enabling better data integration, analysis, and interpretation.There is a critical need to develop means by which to open the black boxes prevalent in many current AI approaches so that they can be interpreted meaningfully from a complex biological perspective. AI decisions and outputs can be explained by breeders and researchers via human–computer interaction.

Journal Article

Share this book

Add to My Shelf

Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case

by Wang, Weiwen , Lanfear, Robert , Kainer, David in Accuracy , Acids , Algae

2018

Background Chloroplasts are organelles that conduct photosynthesis in plant and algal cells. The information chloroplast genome contained is widely used in agriculture and studies of evolution and ecology. Correctly assembling chloroplast genomes can be challenging because the chloroplast genome contains a pair of long inverted repeats (10–30 kb). Typically, it is simply assumed that the gross structure of the chloroplast genome matches the most commonly observed structure of two single-copy regions separated by a pair of inverted repeats. The advent of long-read sequencing technologies should remove the need to make this assumption by providing sufficient information to completely span the inverted repeat regions. Yet, long-reads tend to have higher error rates than short-reads, and relatively little is known about the best way to combine long- and short-reads to obtain the most accurate chloroplast genome assemblies. Using Eucalyptus pauciflora , the snow gum, as a test case, we evaluated the effect of multiple parameters, such as different coverage of long-(Oxford nanopore) and short-(Illumina) reads, different long-read lengths, different assembly pipelines, with a view to determining the most accurate and efficient approach to chloroplast genome assembly. Results Hybrid assemblies combining at least 20x coverage of both long-reads and short-reads generated a single contig spanning the entire chloroplast genome with few or no detectable errors. Short-read-only assemblies generated three contigs (the long single copy, short single copy and inverted repeat regions) of the chloroplast genome. These contigs contained few single-base errors but tended to exclude several bases at the beginning or end of each contig. Long-read-only assemblies tended to create multiple contigs with a much higher single-base error rate. The chloroplast genome of Eucalyptus pauciflora is 159,942 bp, contains 131 genes of known function. Conclusions Our results suggest that very accurate assemblies of chloroplast genomes can be achieved using a combination of at least 20x coverage of long- and short-reads respectively, provided that the long-reads contain at least ~5x coverage of reads longer than the inverted repeat region. We show that further increases in coverage give little or no improvement in accuracy, and that hybrid assemblies are more accurate than long-read-only or short-read-only assemblies.

Journal Article

Share this book

Add to My Shelf

Dysregulation of heterochromatin caused by genomic structural variants may be central to autism spectrum disorder

by Garvin, Michael R. , Kainer, David in autism , chromatin , heterochromatin

2025

Autism spectrum disorder (ASD) is a highly heritable and heterogeneous neuropsychiatric condition whose cause is still unknown. A common function of proteins encoded by reported risk-genes for ASD is chromatin modification, but how this biological process relates to neurodevelopment and autism is unknown. We recently reported frequent genomic variants displaying Non-Mendelian inheritance (NMI) patterns in family trios in two cohorts of individuals with autism. These loci represent putative structural variants (SV) and the genes that carry them participate in neurodevelopment, glutamate signaling, and chromatin modification, confirming previous reports and providing greater detail for involvement of these processes in ASD. The majority of these loci were found in non-coding regions of the genome and were enriched for expression quantitative trait loci suggesting that gene dysregulation results from these genomic disruptions rather than alteration of proteins. Here, we intersected these putative ASD-associated SVs from our earlier work with diverse genome-wide gene regulatory and epigenetic multi-omic layers to identify statistically significant enrichments to understand how they may function to produce autism. We find that these loci are enriched in dense heterochromatin and in transcription factor binding sites for SATB1, SRSF9, and NUP98-HOXA9. A model based on our results indicates that the core of ASD may reside in the dysregulation of a process analogous to RNA-induced Initiation of Transcriptional gene silencing that is meant to maintain heterochromatin. This produces SVs in the genes within these chromosomal regions, which also happen to be enriched for those involved in brain development and immune response. This study mechanistically links previously reported ASD-risk genes involved in chromatin remodeling with neurodevelopment and may explain the role of mutations in ASD. Our results suggest that a large portion of the heritable component of autism is the result of changes in genes that control critical epigenetic processes.

Journal Article

Share this book

Add to My Shelf

Potentially adaptive SARS-CoV-2 mutations discovered with novel spatiotemporal and explainable AI models

by Pavicic, Mirko , Shah, Manesh B. , Cliff, Ashley in Adaptation, Biological , Adaptive mutation , algorithms

2020

Background A mechanistic understanding of the spread of SARS-CoV-2 and diligent tracking of ongoing mutagenesis are of key importance to plan robust strategies for confining its transmission. Large numbers of available sequences and their dates of transmission provide an unprecedented opportunity to analyze evolutionary adaptation in novel ways. Addition of high-resolution structural information can reveal the functional basis of these processes at the molecular level. Integrated systems biology-directed analyses of these data layers afford valuable insights to build a global understanding of the COVID-19 pandemic. Results Here we identify globally distributed haplotypes from 15,789 SARS-CoV-2 genomes and model their success based on their duration, dispersal, and frequency in the host population. Our models identify mutations that are likely compensatory adaptive changes that allowed for rapid expansion of the virus. Functional predictions from structural analyses indicate that, contrary to previous reports, the Asp 614 Gly mutation in the spike glycoprotein (S) likely reduced transmission and the subsequent Pro 323 Leu mutation in the RNA-dependent RNA polymerase led to the precipitous spread of the virus. Our model also suggests that two mutations in the nsp13 helicase allowed for the adaptation of the virus to the Pacific Northwest of the USA. Finally, our explainable artificial intelligence algorithm identified a mutational hotspot in the sequence of S that also displays a signature of positive selection and may have implications for tissue or cell-specific expression of the virus. Conclusions These results provide valuable insights for the development of drugs and surveillance strategies to combat the current and future pandemics.

Journal Article

Share this book

Add to My Shelf

A phylogenomic approach reveals a low somatic mutation rate in a long-lived plant

by Hsieh, Ji-Fan , Bromham, Lindell , Cartwright, Reed A. in Biochemistry, Molecular Biology , Evolution , Genomics

2020

Somatic mutations can have important effects on the life history, ecology, and evolution of plants, but the rate at which they accumulate is poorly understood and difficult to measure directly. Here, we develop a method to measure somatic mutations in individual plants and use it to estimate the somatic mutation rate in a large, long-lived, phenotypically mosaic Eucalyptus melliodora tree. Despite being 100 times larger than Arabidopsis, this tree has a per-generation mutation rate only ten times greater, which suggests that this species may have evolved mechanisms to reduce the mutation rate per unit of growth. This adds to a growing body of evidence that illuminates the correlated evolutionary shifts in mutation rate and life history in plants.

Journal Article

Share this book

Add to My Shelf

Genome-Wide Association Study of Wood Anatomical and Morphological Traits in Populus trichocarpa

by Furches, Anna , Tschaplinski, Timothy J. , Kainer, David in BASIC BIOLOGICAL SCIENCES , Biodiesel fuels , Biofuels

2020

To understand the genetic mechanisms underlying wood anatomical and morphological traits in Populus trichocarpa , we used 869 unrelated genotypes from a common garden in Clatskanie, Oregon that were previously collected from across the distribution range in western North America. Using GEMMA mixed model analysis, we tested for the association of 25 phenotypic traits and nine multitrait combinations with 6.741 million SNPs covering the entire genome. Broad-sense trait heritabilities ranged from 0.117 to 0.477. Most traits were significantly correlated with geoclimatic variables suggesting a role of climate and geography in shaping the variation of this species. Fifty-seven SNPs from single trait GWAS and 11 SNPs from multitrait GWAS passed an FDR threshold of 0.05, leading to the identification of eight and seven nearby candidate genes, respectively. The percentage of phenotypic variance explained (PVE) by the significant SNPs for both single and multitrait GWAS ranged from 0.01% to 6.18%. To further evaluate the potential roles of candidate genes, we used a multi-omic network containing five additional data sets, including leaf and wood metabolite GWAS layers and coexpression and comethylation networks. We also performed a functional enrichment analysis on coexpression nearest neighbors for each gene model identified by the wood anatomical and morphological trait GWAS analyses. Genes affecting cell wall composition and transport related genes were enriched in wood anatomy and stomatal density trait networks. Signaling and metabolism related genes were also common in networks for stomatal density. For leaf morphology traits (leaf dry and wet weight) the networks were significantly enriched for GO terms related to photosynthetic processes as well as cellular homeostasis. The identified genes provide further insights into the genetic control of these traits, which are important determinants of the suitability and sustainability of improved genotypes for lignocellulosic biofuel production.

Journal Article

Share this book

Add to My Shelf

The effectiveness of large language models with RAG for auto-annotating trait and phenotype descriptions

by Kainer, David in Artificial Intelligence in Biology and Bioinformatics , Embedding , Large language models

2025

Abstract Ontologies are highly prevalent in biology and medicine and are always evolving. Annotating biological text, such as observed phenotype descriptions, with ontology terms is a challenging and tedious task. The process of annotation requires a contextual understanding of the input text and of the ontological terms available. While text-mining tools are available to assist, they are largely based on directly matching words and phrases and so lack understanding of the meaning of the query item and of the ontology term labels. Large Language Models (LLMs), however, excel at tasks that require semantic understanding of input text and therefore may provide an improvement for the auto-annotation of text with ontological terms. Here we describe a series of workflows incorporating OpenAI GPT’s capabilities to annotate Arabidopsis thaliana and forest tree phenotypic observations with ontology terms, aiming for results that resemble manually curated annotations. These workflows make use of an LLM to intelligently parse phenotypes into short concepts, followed by finding appropriate ontology terms via embedding vector similarity or via Retrieval-Augmented Generation (RAG). The RAG model is a state-of-the-art approach that augments conversational prompts to the LLM with context-specific data to empower it beyond its pre-trained parameter space. We show that the RAG produces the most accurate automated annotations that are often highly similar or identical to expert-curated annotations.

Journal Article

Share this book

Add to My Shelf

Accuracy of Genomic Prediction for Foliar Terpene Traits in Eucalyptus polybractea

by Padovan, Amanda , Foley, William J , Stone, Eric A in Agricultural production

2018

Unlike agricultural crops, most forest species have not had millennia of improvement through phenotypic selection, but can contribute energy and material resources and possibly help alleviate climate change. Yield gains similar to those achieved in agricultural crops over millennia could be made in forestry species with the use of genomic methods in a much shorter time frame. Here we compare various methods of genomic prediction for eight traits related to foliar terpene yield in Eucalyptus polybractea, a tree grown predominantly for the production of Eucalyptus oil. The genomic markers used in this study are derived from shallow whole genome sequencing of a population of 480 trees. We compare the traditional pedigree-based additive best linear unbiased predictors (ABLUP), genomic BLUP (GBLUP), BayesB genomic prediction model, and a form of GBLUP based on weighting markers according to their influence on traits (BLUP|GA). Predictive ability is assessed under varying marker densities of 10,000, 100,000 and 500,000 SNPs. Our results show that BayesB and BLUP|GA perform best across the eight traits. Predictive ability was higher for individual terpene traits, such as foliar α-pinene and 1,8-cineole concentration (0.59 and 0.73, respectively), than aggregate traits such as total foliar oil concentration (0.38). This is likely a function of the trait architecture and markers used. BLUP|GA was the best model for the two biomass related traits, height and 1 year change in height (0.25 and 0.19, respectively). Predictive ability increased with marker density for most traits, but with diminishing returns. The results of this study are a solid foundation for yield improvement of essential oil producing eucalypts. New markets such as biopolymers and terpene-derived biofuels could benefit from rapid yield increases in undomesticated oil-producing species.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter