Catalogue Search | MBRL

by Stanke, Mario , Brůna, Tomáš , Gabriel, Lars in Accuracy , Algorithms , Analysis

2021

Background BRAKER is a suite of automatic pipelines, BRAKER1 and BRAKER2, for the accurate annotation of protein-coding genes in eukaryotic genomes. Each pipeline trains statistical models of protein-coding genes based on provided evidence and, then predicts protein-coding genes in genomic sequences using both the extrinsic evidence and statistical models. For training and prediction, BRAKER1 and BRAKER2 incorporate complementary extrinsic evidence: BRAKER1 uses only RNA-seq data while BRAKER2 uses only a database of cross-species proteins. The BRAKER suite has so far not been able to reliably exceed the accuracy of BRAKER1 and BRAKER2 when incorporating both types of evidence simultaneously. Currently, for a novel genome project where both RNA-seq and protein data are available, the best option is to run both pipelines independently, and to pick one, likely better output. Therefore, one or another type of the extrinsic evidence would remain unexploited. Results We present TSEBRA, a software that selects gene predictions (transcripts) from the sets generated by BRAKER1 and BRAKER2. TSEBRA uses a set of rules to compare scores of overlapping transcripts based on their support by RNA-seq and homologous protein evidence. We show in computational experiments on genomes of 11 species that TSEBRA achieves higher accuracy than either BRAKER1 or BRAKER2 running alone and that TSEBRA compares favorably with the combiner tool EVidenceModeler. Conclusion TSEBRA is an easy-to-use and fast software tool. It can be used in concert with the BRAKER pipeline to generate a gene prediction set supported by both RNA-seq and homologous protein evidence.

Journal Article

Share this book

Add to My Shelf

Galba: genome annotation with miniprot and AUGUSTUS

by Stanke, Mario , Nenasheva, Natalia , Brůna, Tomáš in Accuracy , Acute coronary syndrome , Algorithms

2023

Background The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes. Results Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments. Conclusions Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.

Journal Article

Share this book

Add to My Shelf

Global, highly specific and fast filtering of alignment seeds

by Migliorelli, Giovanna , Stanke, Mario , Ebel, Matthis in Accuracy , Algorithms , Alignment

2022

Background An important initial phase of arguably most homology search and alignment methods such as required for genome alignments is seed finding . The seed finding step is crucial to curb the runtime as potential alignments are restricted to and anchored at the sequence position pairs that constitute the seed. To identify seeds, it is good practice to use sets of spaced seed patterns, a method that locally compares two sequences and requires exact matches at certain positions only. Results We introduce a new method for filtering alignment seeds that we call geometric hashing . Geometric hashing achieves a high specificity by combining non-local information from different seeds using a simple hash function that only requires a constant and small amount of additional time per spaced seed. Geometric hashing was tested on the task of finding homologous positions in the coding regions of human and mouse genome sequences. Thereby, the number of false positives was decreased about million-fold over sets of spaced seeds while maintaining a very high sensitivity. Conclusions An additional geometric hashing filtering phase could improve the run-time, accuracy or both of programs for various homology-search-and-align tasks.

Journal Article

Share this book

Add to My Shelf

Wild tobacco genomes reveal the evolution of nicotine biosynthesis

by Gase, Klaus , Gaquerel, Emmanuel , Lyons, Eric in Alkaloids , Alkaloids - biosynthesis , Base Sequence

2017

Nicotine, the signature alkaloid of Nicotiana species responsible for the addictive properties of human tobacco smoking, functions as a defensive neurotoxin against attacking herbivores. However, the evolution of the genetic features that contributed to the assembly of the nicotine biosynthetic pathway remains unknown. We sequenced and assembled genomes of two wild tobaccos, Nicotiana attenuata (2.5 Gb) and Nicotiana obtusifolia (1.5 Gb), two ecological models for investigating adaptive traits in nature. We show that after the Solanaceae whole-genome triplication event, a repertoire of rapidly expanding transposable elements (TEs) bloated these Nicotiana genomes, promoted expression divergences among duplicated genes, and contributed to the evolution of herbivoryinduced signaling and defenses, including nicotine biosynthesis. The biosynthetic machinery that allows for nicotine synthesis in the roots evolved from the stepwise duplications of two ancient primary metabolic pathways: the polyamine and nicotinamide adenine dinucleotide (NAD) pathways. In contrast to the duplication of the polyamine pathway that is shared among several solanaceous genera producing polyamine-derived tropane alkaloids, we found that lineage-specific duplications within the NAD pathway and the evolution of rootspecific expression of the duplicated Solanaceae-specific ethylene response factor that activates the expression of all nicotine biosynthetic genes resulted in the innovative and efficient production of nicotine in the genus Nicotiana. Transcription factor binding motifs derived from TEs may have contributed to the coexpression of nicotine biosynthetic pathway genes and coordinated the metabolic flux. Together, these results provide evidence that TEs and gene duplications facilitated the emergence of a key metabolic innovation relevant to plant fitness.

Journal Article

Share this book

Add to My Shelf

Gene Transfer from Bacteria and Archaea Facilitated Evolution of an Extremophilic Eukaryote

by Baker, Brett J. , Carr, Kevin , Ternes, Chad M. in Adaptation, Physiological - genetics , Adenosine triphosphatases , Adenosine Triphosphatases - genetics

2013

Some microbial eukaryotes, such as the extremophilic red alga Galdieria sulphuraria, live in hot, toxic metal-rich, acidic environments. To elucidate the underlying molecular mechanisms of adaptation, we sequenced the 13.7-megabase genome of G. sulphuraria. This alga shows an enormous metabolic flexibility, growing either photoautotrophically or heterotrophically on more than 50 carbon sources. Environmental adaptation seems to have been facilitated by horizontal gene transfer from various bacteria and archaea, often followed by gene family expansion. At least 5% of protein-coding genes of G. sulphuraria were probably acquired horizontally. These proteins are involved in ecologically important processes ranging from heavy-metal detoxification to glycerol uptake and metabolism. Thus, our findings show that a pan-domain gene pool has facilitated environmental adaptation in this unicellular eukaryote.

Journal Article

Share this book

Add to My Shelf

Enhanced genome assembly and a new official gene set for Tribolium castaneum

by Damm, Carsten , Ulrich, Julia , Vargas Jentzsch, Iris M. in Alternative splicing , Animal Genetics and Genomics , Annotations

2020

Background The red flour beetle Tribolium castaneum has emerged as an important model organism for the study of gene function in development and physiology, for ecological and evolutionary genomics, for pest control and a plethora of other topics. RNA interference (RNAi), transgenesis and genome editing are well established and the resources for genome-wide RNAi screening have become available in this model. All these techniques depend on a high quality genome assembly and precise gene models. However, the first version of the genome assembly was generated by Sanger sequencing, and with a small set of RNA sequence data limiting annotation quality. Results Here, we present an improved genome assembly (Tcas5.2) and an enhanced genome annotation resulting in a new official gene set (OGS3) for Tribolium castaneum , which significantly increase the quality of the genomic resources. By adding large-distance jumping library DNA sequencing to join scaffolds and fill small gaps, the gaps in the genome assembly were reduced and the N50 increased to 4753kbp. The precision of the gene models was enhanced by the use of a large body of RNA-Seq reads of different life history stages and tissue types, leading to the discovery of 1452 novel gene sequences. We also added new features such as alternative splicing, well defined UTRs and microRNA target predictions. For quality control, 399 gene models were evaluated by manual inspection. The current gene set was submitted to Genbank and accepted as a RefSeq genome by NCBI. Conclusions The new genome assembly (Tcas5.2) and the official gene set (OGS3) provide enhanced genomic resources for genetic work in Tribolium castaneum . The much improved information on transcription start sites supports transgenic and gene editing approaches. Further, novel types of information such as splice variants and microRNA target genes open additional possibilities for analysis.

Journal Article

Share this book

Add to My Shelf

Discovery and revision of Arabidopsis genes by proteogenomics

by Stanke, Mario , Castellana, Natalie E , Shen, Zhouxin in Amino acid sequence , amino acid sequences , Amino acids

2008

Gene annotation underpins genome science. Most often protein coding sequence is inferred from the genome based on transcript evidence and computational predictions. While generally correct, gene models suffer from errors in reading frame, exon border definition, and exon identification. To ascertain the error rate of Arabidopsis thaliana gene models, we isolated proteins from a sample of Arabidopsis tissues and determined the amino acid sequences of 144,079 distinct peptides by tandem mass spectrometry. The peptides corresponded to 1 or more of 3 different translations of the genome: a 6-frame translation, an exon splice-graph, and the currently annotated proteome. The majority of the peptides (126,055) resided in existing gene models (12,769 confirmed proteins), comprising 40% of annotated genes. Surprisingly, 18,024 novel peptides were found that do not correspond to annotated genes. Using the gene finding program AUGUSTUS and 5,426 novel peptides that occurred in clusters, we discovered 778 new protein-coding genes and refined the annotation of an additional 695 gene models. The remaining 13,449 novel peptides provide high quality annotation (>99% correct) for thousands of additional genes. Our observation that 18,024 of 144,079 peptides did not match current gene models suggests that 13% of the Arabidopsis proteome was incomplete due to approximately equal numbers of missing and incorrect gene models.

Journal Article

Share this book

Add to My Shelf

VARUS: sampling complementary RNA reads from the sequence read archive

by Stanke, Mario , Bruhn, Willy , Becker, Felix in Algorithms , Animals , Antisense RNA

2019

Background Vast amounts of next generation sequencing RNA data has been deposited in archives, accompanying very diverse original studies. The data is readily available also for other purposes such as genome annotation or transcriptome assembly. However, selecting a subset of available experiments, sequencing runs and reads for this purpose is a nontrivial task and complicated by the inhomogeneity of the data. Results This article presents the software VARUS that selects, downloads and aligns reads from NCBI’s Sequence Read Archive, given only the species’ binomial name and genome. VARUS automatically chooses runs from among all archived runs to randomly select subsets of reads. The objective of its online algorithm is to cover a large number of transcripts adequately when network bandwidth and computing resources are limited. For most tested species VARUS achieved both a higher sensitivity and specificity with a lower number of downloaded reads than when runs were manually selected. At the example of twelve eukaryotic genomes, we show that RNA-Seq that was sampled with VARUS is well-suited for fully-automatic genome annotation with BRAKER. Conclusions With VARUS, genome annotation can be automatized to the extent that not even the selection and quality control of RNA-Seq has to be done manually. This introduces the possibility to have fully automatized genome annotation loops over potentially many species without incurring a loss of accuracy over a manually supervised annotation process.

Journal Article

Share this book

Add to My Shelf

Verticillium transcription activator of adhesion Vta2 suppresses microsclerotia formation and is required for systemic infection of plant roots

by Kusch, Harald , Braus, Gerhard H , Tech, Maike in Adhesins , Adhesion , Appressoria

2014

Six transcription regulatory genes of the Verticillium plant pathogen, which reprogrammed nonadherent budding yeasts for adhesion, were isolated by a genetic screen to identify control elements for early plant infection. Verticillium transcription activator of adhesion Vta2 is highly conserved in filamentous fungi but not present in yeasts. The Magnaporthe grisea ortholog conidiation regulator Con7 controls the formation of appressoria which are absent in Verticillium species. Vta2 was analyzed by using genetics, cell biology, transcriptomics, secretome proteomics and plant pathogenicity assays. Nuclear Vta2 activates the expression of the adhesin‐encoding yeast flocculin genes FLO1 and FLO11. Vta2 is required for fungal growth of Verticillium where it is a positive regulator of conidiation. Vta2 is mandatory for accurate timing and suppression of microsclerotia as resting structures. Vta2 controls expression of 270 transcripts, including 10 putative genes for adhesins and 57 for secreted proteins. Vta2 controls the level of 125 secreted proteins, including putative adhesins or effector molecules and a secreted catalase‐peroxidase. Vta2 is a major regulator of fungal pathogenesis, and controls host‐plant root infection and H₂O₂ detoxification. Verticillium impaired in Vta2 is unable to colonize plants and induce disease symptoms. Vta2 represents an interesting target for controlling the growth and development of these vascular pathogens.

Journal Article

Share this book

Add to My Shelf

The role of recombination in the emergence of a complex and dynamic HIV epidemic

by Foley, Brian , Stanke, Mario , Bulla, Ingo in Analysis , Antibodies , BASIC BIOLOGICAL SCIENCES

2010

Background Inter-subtype recombinants dominate the HIV epidemics in three geographical regions. To better understand the role of HIV recombinants in shaping the current HIV epidemic, we here present the results of a large-scale subtyping analysis of 9435 HIV-1 sequences that involve subtypes A, B, C, G, F and the epidemiologically important recombinants derived from three continents. Results The circulating recombinant form CRF02_AG, common in West Central Africa, appears to result from recombination events that occurred early in the divergence between subtypes A and G, followed by additional recent recombination events that contribute to the breakpoint pattern defining the current recombinant lineage. This finding also corrects a recent claim that G is a recombinant and a descendant of CRF02, which was suggested to be a pure subtype. The BC and BF recombinants in China and South America, respectively, are derived from recent recombination between contemporary parental lineages. Shared breakpoints in South America BF recombinants indicate that the HIV-1 epidemics in Argentina and Brazil are not independent. Therefore, the contemporary HIV-1 epidemic has recombinant lineages of both ancient and more recent origins. Conclusions Taken together, we show that these recombinant lineages, which are highly prevalent in the current HIV epidemic, are a mixture of ancient and recent recombination. The HIV pandemic is moving towards having increasing complexity and higher prevalence of recombinant forms, sometimes existing as \"families\" of related forms. We find that the classification of some CRF designations need to be revised as a consequence of (1) an estimated > 5% error in the original subtype assignments deposited in the Los Alamos sequence database; (2) an increasing number of CRFs are defined while they do not readily fit into groupings for molecular epidemiology and vaccine design; and (3) a dynamic HIV epidemic context.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter