Catalogue Search | MBRL

Structural variant calling: the long and the short of it

by Sedlazeck, Fritz J. , Cruz-Dávalos, Diana Ivette , Mahmoud, Medhat in Animal Genetics and Genomics , Animals , Bioinformatics

2019

Recent research into structural variants (SVs) has established their importance to medicine and molecular biology, elucidating their role in various diseases, regulation of gene expression, ethnic diversity, and large-scale chromosome evolution—giving rise to the differences within populations and among species. Nevertheless, characterizing SVs and determining the optimal approach for a given experimental design remains a computational and scientific challenge. Multiple approaches have emerged to target various SV classes, zygosities, and size ranges. Here, we review these approaches with respect to their ability to infer SVs across the full spectrum of large, complex variations and present computational methods for each approach.

Journal Article

Share this book

Add to My Shelf

Inferring Horizontal Gene Transfer

by Dessimoz, Christophe , Ravenhall, Matt , Škunca, Nives in Base Composition , Computational Biology , Computer Simulation

2015

Horizontal or Lateral Gene Transfer (HGT or LGT) is the transmission of portions of genomic DNA between organisms through a process decoupled from vertical inheritance. In the presence of HGT events, different fragments of the genome are the result of different evolutionary histories. This can therefore complicate the investigations of evolutionary relatedness of lineages and species. Also, as HGT can bring into genomes radically different genotypes from distant lineages, or even new genes bearing new functions, it is a major source of phenotypic innovation and a mechanism of niche adaptation. For example, of particular relevance to human health is the lateral transfer of antibiotic resistance and pathogenicity determinants, leading to the emergence of pathogenic lineages. Computational identification of HGT events relies upon the investigation of sequence composition or evolutionary history of genes. Sequence composition-based (\"parametric\") methods search for deviations from the genomic average, whereas evolutionary history-based (\"phylogenetic\") approaches identify genes whose evolutionary history significantly differs from that of the host species. The evaluation and benchmarking of HGT inference methods typically rely upon simulated genomes, for which the true history is known. On real data, different methods tend to infer different HGT events, and as a result it can be difficult to ascertain all but simple and clear-cut HGT events.

Journal Article

Share this book

Add to My Shelf

Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree

by Dylus, David , Sedlazeck, Fritz J. , Altenhoff, Adrian in 631/114/2785 , 631/114/739 , 631/1647/2217/748

2024

Current methods for inference of phylogenetic trees require running complex pipelines at substantial computational and labor costs, with additional constraints in sequencing coverage, assembly and annotation quality, especially for large datasets. To overcome these challenges, we present Read2Tree, which directly processes raw sequencing reads into groups of corresponding genes and bypasses traditional steps in phylogeny inference, such as genome assembly, annotation and all-versus-all sequence comparisons, while retaining accuracy. In a benchmark encompassing a broad variety of datasets, Read2Tree is 10–100 times faster than assembly-based approaches and in most cases more accurate—the exception being when sequencing coverage is high and reference species very distant. Here, to illustrate the broad applicability of the tool, we reconstruct a yeast tree of life of 435 species spanning 590 million years of evolution. We also apply Read2Tree to >10,000 Coronaviridae samples, accurately classifying highly diverse animal samples and near-identical severe acute respiratory syndrome coronavirus 2 sequences on a single tree. The speed, accuracy and versatility of Read2Tree enable comparative genomics at scale. Phylogenetic trees are generated from sequencing reads without genome assembly or annotation.

Journal Article

Share this book

Add to My Shelf

Survey of Branch Support Methods Demonstrates Accuracy, Power, and Robustness of Fast Likelihood-based Approximation Schemes

by Dessimoz, Christophe , Gascuel, Olivier , Anisimova, Maria in Accuracy , Amino Acids - genetics , Animals

2011

Phylogenetic inference and evaluating support for inferred relationships is at the core of many studies testing evolutionary hypotheses. Despite the popularity of nonparametric bootstrap frequencies and Bayesian posterior probabilities, the interpretation of these measures of tree branch support remains a source of discussion. Furthermore, both methods are computationally expensive and become prohibitive for large data sets. Recent fast approximate likelihood-based measures of branch supports (approximate likelihood ratio test [aLRT] and Shimodaira-Hasegawa [SH]-aLRT) provide a compelling alternative to these slower conventional methods, offering not only speed advantages but also excellent levels of accuracy and power. Here we propose an additional method: a. Bayesian-like transformation of aLRT (aBayes). Considering both probabilistic and frequentisi frameworks, we compare the performance of the three fast likelihood-based methods with the standard bootstrap (SBS), the Bayesian approach, and the recently introduced rapid bootstrap. Our simulations and real data analyses show that with moderate model violations, all tests are sufficiently accurate, but aLRT and aBayes offer the highest statistical power and are very fast. With severe model violations aLRT, aBayes and Bayesian posteriors can produce elevated false-positive rates. With data sets for which such violation can be detected, we recommend using SH-aLRT, the nonparametric version of aLRT based on a procedure similar to the Shimodaira-Hasegawa tree selection. In general, the SBS seems to be excessively conservative and is much slower than our approximate likelihood-based methods.

Journal Article

Share this book

Add to My Shelf

Towards practical, high-capacity, low-maintenance information storage in synthesized DNA

by Bertone, Paul , Birney, Ewan , Chen, Siyuan in 631/114/552 , 639/301/54/992 , 639/705/258

2013

An efficient and scalable strategy with robust error correction is reported for encoding a record amount of information (including images, text and audio files) in DNA strands; a ‘DNA archive’ has been synthesized, shipped from the USA to Germany, sequenced and the information read. Long-term DNA archives make sense This multidisciplinary study in synthetic biology both proposes and demonstrates a system for the DNA-based storage of digital information. Digital information is being produced at an ever-growing rate, requiring an increasing commitment to ongoing maintenance of digital media in the archives. Surprisingly, this provides a niche for DNA, which can serve as a dense and stable information-storage medium. Nick Goldman et al . report an efficient and scalable strategy with robust error correction for encoding a record amount of information (including images, text and audio files) in DNA strands. After synthesizing a 'DNA archive' and shipping it from California to Germany, the DNA was sequenced and the information read. At the current rate of DNA synthesis cost reduction, DNA-based information storage is expected to become cost effective within a decade for archives likely to be accessed only rarely, after about 50 years. Digital production, transmission and storage have revolutionized how we access and use information but have also made archiving an increasingly complex task that requires active, continuing maintenance of digital media. This challenge has focused some interest on DNA as an attractive target for information storage 1 because of its capacity for high-density information encoding, longevity under easily achieved conditions 2 , 3 , 4 and proven track record as an information bearer. Previous DNA-based information storage approaches have encoded only trivial amounts of information 5 , 6 , 7 or were not amenable to scaling-up 8 , and used no robust error-correction and lacked examination of their cost-efficiency for large-scale information archival 9 . Here we describe a scalable method that can reliably store more information than has been handled before. We encoded computer files totalling 739 kilobytes of hard-disk storage and with an estimated Shannon information 10 of 5.2 × 10 6 bits into a DNA code, synthesized this DNA, sequenced it and reconstructed the original files with 100% accuracy. Theoretical analysis indicates that our DNA-based storage scheme could be scaled far beyond current global information volumes and offers a realistic technology for large-scale, long-term and infrequently accessed digital archiving. In fact, current trends in technological advances are reducing DNA synthesis costs at a pace that should make our scheme cost-effective for sub-50-year archiving within a decade.

Journal Article

Share this book

Add to My Shelf

Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast

by Sedlazeck, Fritz J. , Bähler, Jürg , Jeffares, Daniel C. in 38/61 , 631/114/2785 , 631/208/212

2017

Large structural variations (SVs) within genomes are more challenging to identify than smaller genetic variants but may substantially contribute to phenotypic diversity and evolution. We analyse the effects of SVs on gene expression, quantitative traits and intrinsic reproductive isolation in the yeast Schizosaccharomyces pombe . We establish a high-quality curated catalogue of SVs in the genomes of a worldwide library of S. pombe strains, including duplications, deletions, inversions and translocations. We show that copy number variants (CNVs) show a variety of genetic signals consistent with rapid turnover. These transient CNVs produce stoichiometric effects on gene expression both within and outside the duplicated regions. CNVs make substantial contributions to quantitative traits, most notably intracellular amino acid concentrations, growth under stress and sugar utilization in winemaking, whereas rearrangements are strongly associated with reproductive isolation. Collectively, these findings have broad implications for evolution and for our understanding of quantitative traits including complex human diseases. Fission yeast Schizosaccharomyces pombe has diverse traits. Jeffares et al . characterize large copy number variations (CNVs) and rearrangements in S. pombe , and show that CNVs are transient with effects on quantitative traits and gene expression, whereas rearrangements influence intrinsic reproductive isolation.

Journal Article

Share this book

Add to My Shelf

Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference

by Ledergerber, Christian , Herrero, Javier , Gil, Manuel in Algorithms , Classification - methods , Comparative analysis

2015

Phylogenetic inference is generally performed on the basis of multiple sequence alignments (MSA). Because errors in an alignment can lead to errors in tree estimation, there is a strong interest in identifying and removing unreliable parts of the alignment. In recent years several automated filtering approaches have been proposed, but despite their popularity, a systematic and comprehensive comparison of different alignment filtering methods on real data has been lacking. Here, we extend and apply recently introduced phylogenetic tests of alignment accuracy on a large number of gene families and contrast the performance of unfiltered versus filtered alignments in the context of single-gene phylogeny reconstruction. Based on multiple genome-wide empirical and simulated data sets, we show that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs. Furthermore, alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong. We confirm that our findings hold for a wide range of parameters and methods. Although our results suggest that light filtering (up to 20% of alignment positions) has little impact on tree accuracy and may save some computation time, contrary to widespread practice, we do not generally recommend the use of current alignment filtering methods for phylogenetic inference. By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithms,

Journal Article

Share this book

Add to My Shelf

Protein length distribution is remarkably uniform across the tree of life

by Glover, Natasha M. , Dessimoz, Christophe , Nevers, Yannis in Amino Acid Sequence , Animal Genetics and Genomics , Archaea

2023

Background In every living species, the function of a protein depends on its organization of structural domains, and the length of a protein is a direct reflection of this. Because every species evolved under different evolutionary pressures, the protein length distribution, much like other genomic features, is expected to vary across species but has so far been scarcely studied. Results Here we evaluate this diversity by comparing protein length distribution across 2326 species (1688 bacteria, 153 archaea, and 485 eukaryotes). We find that proteins tend to be on average slightly longer in eukaryotes than in bacteria or archaea, but that the variation of length distribution across species is low, especially compared to the variation of other genomic features (genome size, number of proteins, gene length, GC content, isoelectric points of proteins). Moreover, most cases of atypical protein length distribution appear to be due to artifactual gene annotation, suggesting the actual variation of protein length distribution across species is even smaller. Conclusions These results open the way for developing a genome annotation quality metric based on protein length distribution to complement conventional quality measures. Overall, our findings show that protein length distribution between living species is more uniform than previously thought. Furthermore, we also provide evidence for a universal selection on protein length, yet its mechanism and fitness effect remain intriguing open questions.

Journal Article

Share this book

Add to My Shelf

AI and the democratization of knowledge

by Dessimoz, Christophe , Thomas, Paul D. in 631/114 , 706/648 , 706/648/1496

2024

The solution of the longstanding “protein folding problem” in 2021 showcased the transformative capabilities of AI in advancing the biomedical sciences. AI was characterized as successfully learning from protein structure data , which then spurred a more general call for AI-ready datasets to drive forward medical research. Here, we argue that it is the broad availability of knowledge , not just data, that is required to fuel further advances in AI in the scientific domain. This represents a quantum leap in a trend toward knowledge democratization that had already been developing in the biomedical sciences: knowledge is no longer primarily applied by specialists in a sub-field of biomedicine, but rather multidisciplinary teams, diverse biomedical research programs, and now machine learning. The development and application of explicit knowledge representations underpinning democratization is becoming a core scientific activity, and more investment in this activity is required if we are to achieve the promise of AI.

Journal Article

Share this book

Add to My Shelf

Approximate Bayesian Computation

by Numminen, Elina , Sunnåker, Mikael , Dessimoz, Christophe in Algorithms , Approximation theory , Bayes Theorem

2013

Approximate Bayesian computation (ABC) constitutes a class of computational methods rooted in Bayesian statistics. In all model-based statistical inference, the likelihood function is of central importance, since it expresses the probability of the observed data under a particular statistical model, and thus quantifies the support data lend to particular values of parameters and to choices among different models. For simple models, an analytical formula for the likelihood function can typically be derived. However, for more complex models, an analytical formula might be elusive or the likelihood function might be computationally very costly to evaluate. ABC methods bypass the evaluation of the likelihood function. In this way, ABC methods widen the realm of models for which statistical inference can be considered. ABC methods are mathematically well-founded, but they inevitably make assumptions and approximations whose impact needs to be carefully assessed. Furthermore, the wider application domain of ABC exacerbates the challenges of parameter estimation and model selection. ABC has rapidly gained popularity over the last years and in particular for the analysis of complex problems arising in biological sciences (e.g., in population genetics, ecology, epidemiology, and systems biology).

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter