Catalogue Search | MBRL

A Fast, Reproducible, High-throughput Variant Calling Workflow for Population Genomics

by Gregg W C Thomas , Allison J Shultz , Erik Enbody in Availability , Comparative analysis , Data visualization

2024

Abstract The increasing availability of genomic resequencing data sets and high-quality reference genomes across the tree of life present exciting opportunities for comparative population genomic studies. However, substantial challenges prevent the simple reuse of data across different studies and species, arising from variability in variant calling pipelines, data quality, and the need for computationally intensive reanalysis. Here, we present snpArcher, a flexible and highly efficient workflow designed for the analysis of genomic resequencing data in nonmodel organisms. snpArcher provides a standardized variant calling pipeline and includes modules for variant quality control, data visualization, variant filtering, and other downstream analyses. Implemented in Snakemake, snpArcher is user-friendly, reproducible, and designed to be compatible with high-performance computing clusters and cloud environments. To demonstrate the flexibility of this pipeline, we applied snpArcher to 26 public resequencing data sets from nonmammalian vertebrates. These variant data sets are hosted publicly to enable future comparative population genomic analyses. With its extensibility and the availability of public data sets, snpArcher will contribute to a broader understanding of genetic variation across species by facilitating the rapid use and reuse of large genomic data sets.

Journal Article

Share this book

Add to My Shelf

The core-pivotal index: a geometry-first approach to academic impact

by Corbett-Detig, Russ

2026

Journal Article

Share this book

Add to My Shelf

A Phylogenetic Method Identifies Candidate Drivers of the Evolution of the SARS-CoV-2 Mutation Spectrum

by Corbett-Detig, Russ in Algorithms , Amino Acid Substitution , Brief Communications

2025

The molecular processes that generate new mutations evolve, but the causal mechanisms are largely unknown. In particular, the relative rates of mutation types (e.g. C > T), the mutation spectrum, sometimes vary among closely related species and populations. I present an algorithm for subdividing a phylogeny into distinct mutation spectra. By applying this approach to a SARS-CoV-2 phylogeny comprising approximately 8 million genome sequences, I identify ten shifts in the mutation spectrum. I find strong enrichment consistent with candidate causal amino-acid substitutions in the SARS-CoV-2 polymerase, and strikingly three appearances of the same homoplasious substitution are each associated with decreased C > T relative mutation rates. With rapidly growing genomic datasets, this approach and future extensions promise new insights into the mechanisms of the evolution of mutational processes. Keywords: Mutation Spectrum; Phylogenetic Analysis; SARS-CoV-2 Evolution

Journal Article

Share this book

Add to My Shelf

Efficient Estimation of Nucleotide Diversity and Divergence Using Callable Loci (and More)

by Mirchandani, Cade , Corbett-Detig, Russ , Sackton, Timothy B in Animals , Computer applications , Datasets

2025

Abstract The increasing scale of population genomic datasets presents computational challenges in estimating summary statistics such as nucleotide diversity (π) and divergence (dxy). Accurate estimates of diversity require knowledge of missing data, and existing tools require all-site VCFs. However, generating these files is computationally expensive for large datasets. Here, we introduce Callable Loci And More (clam), a tool that leverages callable loci—determined from depth information—to estimate population genetic statistics using a variant-only VCF. This approach offers improvements in storage footprint and computational performance compared to contemporary methods. We validate clam's accuracy using simulated data, demonstrating that it produces estimates of π, dxy, and fixation index (FST) identical to those from all-site VCF approaches. We then benchmark clam using a large muskox dataset and demonstrate that it produces accurate estimates of π while substantially reducing runtime requirements compared to current best-practice methods. clam provides an efficient and scalable alternative for population genomic analyses, facilitating the study of increasingly large and diverse datasets. clam is available as a standalone program and integrated into snpArcher for efficient reproducible population genomic analysis.

Journal Article

Share this book

Add to My Shelf

A phylogenetic method identifies candidate drivers of the evolution of the SARS-CoV-2 mutation spectrum

by Corbett-Detig, Russ in Bioinformatics

2025

The molecular processes that generate new mutations evolve, but the causal mechanisms are largely unknown. In particular, the relative rates of mutation types ( , C>T), the mutation spectrum, sometimes vary among closely related species and populations. I present an algorithm for subdividing a phylogeny into distinct mutation spectra. By applying this approach to a SARS-CoV-2 phylogeny comprising approximately eight million genome sequences, I identify 10 shifts in the mutation spectrum. I find strong enrichment consistent with candidate causal amino-acid substitutions in the SARS-CoV-2 polymerase, and strikingly three appearances of the same homoplasious substitution are each associated with decreased C>T relative mutation rates. With rapidly growing genomic datasets, this approach and future extensions promises new insights into the mechanisms of evolution of mutational processes.

Journal Article

Share this book

Add to My Shelf

Universal signatures of transposable element compartmentalization across eukaryotic genomes

by Hartl, Daniel L , Gozashti, Landen , Corbett-Detig, Russell in Genomics

2024

The evolutionary mechanisms that drive the emergence of genome architecture remain poorly understood but can now be assessed with unprecedented power due to the massive accumulation of genome assemblies spanning phylogenetic diversity1,2. Transposable elements (TEs) are a rich source of large-effect mutations since they directly and indirectly drive genomic structural variation and changes in gene expression3. Here, we demonstrate universal patterns of TE compartmentalization across eukaryotic genomes spanning ~1.7 billion years of evolution, in which TEs colocalize with gene families under strong predicted selective pressure for dynamic evolution and involved in specific functions. For non-pathogenic species these genes represent families involved in defense, sensory perception and environmental interaction, whereas for pathogenic species, TE-compartmentalized genes are highly enriched for pathogenic functions. Many TE-compartmentalized gene families display signatures of positive selection at the molecular level. Furthermore, TE-compartmentalized genes exhibit an excess of high-frequency alleles for polymorphic TE insertions in fruit fly populations. We postulate that these patterns reflect selection for adaptive TE insertions as well as TE-associated structural variants. This process may drive the emergence of a shared TE-compartmentalized genome architecture across diverse eukaryotic lineages.The evolutionary mechanisms that drive the emergence of genome architecture remain poorly understood but can now be assessed with unprecedented power due to the massive accumulation of genome assemblies spanning phylogenetic diversity1,2. Transposable elements (TEs) are a rich source of large-effect mutations since they directly and indirectly drive genomic structural variation and changes in gene expression3. Here, we demonstrate universal patterns of TE compartmentalization across eukaryotic genomes spanning ~1.7 billion years of evolution, in which TEs colocalize with gene families under strong predicted selective pressure for dynamic evolution and involved in specific functions. For non-pathogenic species these genes represent families involved in defense, sensory perception and environmental interaction, whereas for pathogenic species, TE-compartmentalized genes are highly enriched for pathogenic functions. Many TE-compartmentalized gene families display signatures of positive selection at the molecular level. Furthermore, TE-compartmentalized genes exhibit an excess of high-frequency alleles for polymorphic TE insertions in fruit fly populations. We postulate that these patterns reflect selection for adaptive TE insertions as well as TE-associated structural variants. This process may drive the emergence of a shared TE-compartmentalized genome architecture across diverse eukaryotic lineages.

Journal Article

Share this book

Add to My Shelf

SARS-CoV-2 lineage assignments using phylogenetic placement/UShER are superior to pangoLEARN machine-learning method

by Turakhia, Yatish , Hinrichs, Angie S , Scher, Emily in Phylogenetics , Severe acute respiratory syndrome coronavirus 2

2024

Abstract With the rapid spread and evolution of SARS-CoV-2, the ability to monitor its transmission and distinguish among viral lineages is critical for pandemic response efforts. The most commonly used software for the lineage assignment of newly isolated SARS-CoV-2 genomes is pangolin, which offers two methods of assignment, pangoLEARN and pUShER. PangoLEARN rapidly assigns lineages using a machine-learning algorithm, while pUShER performs a phylogenetic placement to identify the lineage corresponding to a newly sequenced genome. In a preliminary study, we observed that pangoLEARN (decision tree model), while substantially faster than pUShER, offered less consistency across different versions of pangolin v3. Here, we expand upon this analysis to include v3 and v4 of pangolin, which moved the default algorithm for lineage assignment from pangoLEARN in v3 to pUShER in v4, and perform a thorough analysis confirming that pUShER is not only more stable across versions but also more accurate. Our findings suggest that future lineage assignment algorithms for various pathogens should consider the value of phylogenetic placement.

Journal Article

Share this book

Add to My Shelf

Online Phylogenetics using Parsimony Produces Slightly Better Trees and is Dramatically More Efficient for Large SARS-CoV-2 Phylogenies than de novo and Maximum-Likelihood Approaches

by Turakhia, Yatish , Lanfear, Robert , Thornlow, Bryan in Bioinformatics , Computer applications , Contact tracing

2022

Phylogenetics has been foundational to SARS-CoV-2 research and public health policy, assisting in genomic surveillance, contact tracing, and assessing emergence and spread of new variants. However, phylogenetic analyses of SARS-CoV-2 have often relied on tools designed for phylogenetic inference, in which all data are collected before any analysis is performed and the phylogeny is inferred once from scratch. SARS-CoV-2 datasets do not fit this mould. There are currently over 10 million sequenced SARS-CoV-2 genomes in online databases, with tens of thousands of new genomes added every day. Continuous data collection, combined with the public health relevance of SARS-CoV-2, invites an \"online\" approach to phylogenetics, in which new samples are added to existing phylogenetic trees every day. The extremely dense sampling of SARS-CoV-2 genomes also invites a comparison between likelihood and parsimony approaches to phylogenetic inference. Maximum likelihood (ML) methods are more accurate when there are multiple changes at a single site on a single branch, but this accuracy comes at a large computational cost, and the dense sampling of SARS-CoV-2 genomes means that these instances will be extremely rare because each internal branch is expected to be extremely short. Therefore, it may be that approaches based on maximum parsimony (MP) are sufficiently accurate for reconstructing phylogenies of SARS-CoV-2, and their simplicity means that they can be applied to much larger datasets. Here, we evaluate the performance of and online phylogenetic approaches, and ML and MP frameworks, for inferring large and dense SARS-CoV-2 phylogenies. Overall, we find that online phylogenetics produces similar phylogenetic trees to analyses for SARS-CoV-2, and that MP optimizations produce more accurate SARS-CoV-2 phylogenies than do ML optimizations. Since MP is thousands of times faster than presently available implementations of ML and online phylogenetics is faster than , we therefore propose that, in the context of comprehensive genomic epidemiology of SARS-CoV-2, MP online phylogenetics approaches should be favored.

Journal Article

Share this book

Add to My Shelf

Fine-scale position effects shape the distribution of inversion breakpoints in Drosophila melanogaster

by Mcbroome, Jakob , Corbett-Detig, Russ , Liang, David in Breakpoints , Chromatin , Evolutionary Biology

2019

Chromosomal inversions are among the primary drivers of genome structure evolution in a wide range of natural populations. While there is an impressive array of theory and empirical analyses that has identified conditions under which inversions can be positively selected, comparatively little data is available on the fitness impacts of these genome structural rearrangements themselves. Because inversion breakpoints can interrupt functional elements and alter chromatin domains, each rearrangement may in itself have strong effects on fitness. Here, we compared the fine-scale distribution of low frequency inversion breakpoints with those of high frequency inversions and inversions that have fixed between Drosophila species. We identified important differences that may influence inversion fitness. In particular, proximity to insulator elements, large tandem duplications adjacent to the breakpoints, and minimal impacts on gene coding spans are more prevalent in high frequency and fixed inversions than in rare inversions. The data suggest that natural selection acts both to preserve both genes and larger cis-regulatory networks in the occurrence and spread of rearrangements. These factors may act to limit the availability of high fitness arrangements when suppressed recombination is favorable.

Paper

Share this book

Add to My Shelf

Deep data mining reveals variable abundance and distribution of microbial reproductive manipulators within and among diverse host species

by Russell, Shelbi , Corbett-Detig, Russ , Medina, Paloma in Arthropods , Biological control , Data mining

2019,2020

Bacterial symbionts that manipulate the reproduction of their hosts to increase their successful transmission are important factors in invertebrate ecology and evolution. In light of their use as a biological control agent, studying the genomic and phenotypic diversity of reproductive manipulators can improve efforts to control infectious diseases and contribute to our understanding of host-symbiont evolution. Despite the vast genomic and phenotypic diversity of reproductive manipulators, only a handful of Wolbachia strains are used as biological control agents because little is known about the broad scale infection frequencies of these bacteria in nature. Here we develop a data mining approach to quantify the number of arthropod and nematode host species available on the Sequence Read Archive (SRA) that are infected with Wolbachia and other reproductive manipulators such as Rickettsia and Spiroplasma. Across the entire database, we found reproductive manipulators infected 1733 arthropod and 103 nematode samples, representing 121 and 10 species, respectively. We estimated that Wolbachia infects approximately 24% of all arthropod species and 20% of all nematode species. In contrast, we estimated other reproductive manipulators infect 0-8% of arthropod and nematode species. We show that relative Wolbachia density within hosts, titer, is significantly lower than the titer of the other reproductive manipulators. Considering the fitness costs of high titers, low titer may contribute to enabling Wolbachia's high prevalence across hosts species and mitigate impacts on host biology compared with other reproductive manipulator taxa. Our study demonstrates that data mining is a powerful tool for understanding host-symbiont co-evolution and opens an array of previously inaccessible questions for further analysis.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter