Catalogue Search | MBRL

Improved transcriptome assembly using a hybrid of long and short reads with StringTie

by Pertea, Geo , Shumate, Alaina , Pertea, Mihaela in Algorithms , Animals , Biology and Life Sciences

2022

Short-read RNA sequencing and long-read RNA sequencing each have their strengths and weaknesses for transcriptome assembly. While short reads are highly accurate, they are rarely able to span multiple exons. Long-read technology can capture full-length transcripts, but its relatively high error rate often leads to mis-identified splice sites. Here we present a new release of StringTie that performs hybrid-read assembly. By taking advantage of the strengths of both long and short reads, hybrid-read assembly with StringTie is more accurate than long-read only or short-read only assembly, and on some datasets it can more than double the number of correctly assembled transcripts, while obtaining substantially higher precision than the long-read data assembly alone. Here we demonstrate the improved accuracy on simulated data and real data from Arabidopsis thaliana , Mus musculus , and human. We also show that hybrid-read assembly is more accurate than correcting long reads prior to assembly while also being substantially faster. StringTie is freely available as open source software at https://github.com/gpertea/stringtie .

Journal Article

Share this book

Add to My Shelf

LiftoffTools: a toolkit for comparing gene annotations mapped between genome assemblies version 2; peer review: 1 approved, 1 approved with reservations

by Shumate, Alaina , Salzberg, Steven in Amino acids , Annotations , Bioinformatics

2022

In 2020 we published Liftoff, which was the first standalone tool specifically designed for transferring gene annotations between genome assemblies of the same or closely related species. While the gene content is expected to be very similar in closely related genomes, the differences may be biologically consequential, and a computational method to extract all gene-related differences should prove useful in the analysis of such genomes. Here we present LiftoffTools, a toolkit to automate the detection and analysis of gene sequence variants, synteny, and gene copy number changes. We provide a description of the toolkit and an example of its use comparing genes mapped between two human genome assemblies.

Journal Article

Share this book

Add to My Shelf

CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise

by Pertea, Geo , Pandey, Akhilesh , Salzberg, Steven L. in Adenoviruses , Amino Acid Sequence , Animal Genetics and Genomics

2018

We assembled the sequences from deep RNA sequencing experiments by the Genotype-Tissue Expression (GTEx) project, to create a new catalog of human genes and transcripts, called CHESS. The new database contains 42,611 genes, of which 20,352 are potentially protein-coding and 22,259 are noncoding, and a total of 323,258 transcripts. These include 224 novel protein-coding genes and 116,156 novel transcripts. We detected over 30 million additional transcripts at more than 650,000 genomic loci, nearly all of which are likely nonfunctional, revealing a heretofore unappreciated amount of transcriptional noise in human cells. The CHESS database is available at http://ccb.jhu.edu/chess .

Journal Article

Share this book

Add to My Shelf

Curated variation benchmarks for challenging medically relevant autosomal genes

by Sedlazeck, Fritz J. , Shumate, Alaina , Harris, Lindsay in 631/114/2416 , 631/114/2785/2302 , 631/208/212/2301

2022

The repetitive nature and complexity of some medically relevant genes poses a challenge for their accurate analysis in a clinical setting. The Genome in a Bottle Consortium has provided variant benchmark sets, but these exclude nearly 400 medically relevant genes due to their repetitiveness or polymorphic complexity. Here, we characterize 273 of these 395 challenging autosomal genes using a haplotype-resolved whole-genome assembly. This curated benchmark reports over 17,000 single-nucleotide variations, 3,600 insertions and deletions and 200 structural variations each for human genome reference GRCh37 and GRCh38 across HG002. We show that false duplications in either GRCh37 or GRCh38 result in reference-specific, missed variants for short- and long-read technologies in medically relevant genes, including CBS , CRYAA and KCNE1 . When masking these false duplications, variant recall can improve from 8% to 100%. Forming benchmarks from a haplotype-resolved whole-genome assembly may become a prototype for future benchmarks covering the whole genome. Variant detection in problematic genes is facilitated with a curated benchmark.

Journal Article

Share this book

Add to My Shelf

Chromosome-Scale Assembly of the Bread Wheat Genome Reveals Thousands of Additional Gene Copies

by Shumate, Alaina , Zimin, Aleksey V , Alonge, Michael in Agricultural research , Annotations , Assembly

2020

Abstract Bread wheat (Triticum aestivum) is a major food crop and an important plant system for agricultural genetics research. However, due to the complexity and size of its allohexaploid genome, genomic resources are limited compared to other major crops. The IWGSC recently published a reference genome and associated annotation (IWGSC CS v1.0, Chinese Spring) that has been widely adopted and utilized by the wheat community. Although this reference assembly represents all three wheat subgenomes at chromosome-scale, it was derived from short reads, and thus is missing a substantial portion of the expected 16 Gbp of genomic sequence. We earlier published an independent wheat assembly (Triticum_aestivum_3.1, Chinese Spring) that came much closer in length to the expected genome size, although it was only a contig-level assembly lacking gene annotations. Here, we describe a reference-guided effort to scaffold those contigs into chromosome-length pseudomolecules, add in any missing sequence that was unique to the IWGSC CS v1.0 assembly, and annotate the resulting pseudomolecules with genes. Our updated assembly, Triticum_aestivum_4.0, contains 15.07 Gbp of nongap sequence anchored to chromosomes, which is 1.2 Gbps more than the previous reference assembly. It includes 108,639 genes unambiguously localized to chromosomes, including over 2000 genes that were previously unplaced. We also discovered >5700 additional gene copies, facilitating the accurate annotation of functional gene duplications including at the Ppd-B1 photoperiod response locus.

Journal Article

Share this book

Add to My Shelf

CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure

by Erdogdu, Beril , Chao, Kuan-Hao , Minkin, Ilia in Algorithms , Animal Genetics and Genomics , Annotations

2023

CHESS 3 represents an improved human gene catalog based on nearly 10,000 RNA-seq experiments across 54 body sites. It significantly improves current genome annotation by integrating the latest reference data and algorithms, machine learning techniques for noise filtering, and new protein structure prediction methods. CHESS 3 contains 41,356 genes, including 19,839 protein-coding genes and 158,377 transcripts, with 14,863 protein-coding transcripts not in other catalogs. It includes all MANE transcripts and at least one transcript for most RefSeq and GENCODE genes. On the CHM13 human genome, the CHESS 3 catalog contains an additional 129 protein-coding genes. CHESS 3 is available at http://ccb.jhu.edu/chess .

Journal Article

Share this book

Add to My Shelf

Assembly and annotation of an Ashkenazi human reference genome

by Salzberg, Steven L. , Wagner, Justin M. , Salit, Marc L. in Animal Genetics and Genomics , Annotations , Bioinformatics

2020

Background Thousands of experiments and studies use the human reference genome as a resource each year. This single reference genome, GRCh38, is a mosaic created from a small number of individuals, representing a very small sample of the human population. There is a need for reference genomes from multiple human populations to avoid potential biases. Results Here, we describe the assembly and annotation of the genome of an Ashkenazi individual and the creation of a new, population-specific human reference genome. This genome is more contiguous and more complete than GRCh38, the latest version of the human reference genome, and is annotated with highly similar gene content. The Ashkenazi reference genome, Ash1, contains 2,973,118,650 nucleotides as compared to 2,937,639,212 in GRCh38. Annotation identified 20,157 protein-coding genes, of which 19,563 are > 99% identical to their counterparts on GRCh38. Most of the remaining genes have small differences. Forty of the protein-coding genes in GRCh38 are missing from Ash1; however, all of these genes are members of multi-gene families for which Ash1 contains other copies. Eleven genes appear on different chromosomes from their homologs in GRCh38. Alignment of DNA sequences from an unrelated Ashkenazi individual to Ash1 identified ~ 1 million fewer homozygous SNPs than alignment of those same sequences to the more-distant GRCh38 genome, illustrating one of the benefits of population-specific reference genomes. Conclusions The Ash1 genome is presented as a reference for any genetic studies involving Ashkenazi Jewish individuals.

Journal Article

Share this book

Add to My Shelf

The genome of the American groundhog, Marmota monax version 1; peer review: 2 approved

by Bhaskaran, Manoj , Shumate, Alaina , Salzberg, Steven L in Animals , Annotations , Base Sequence

2020

We sequenced the genome of the North American groundhog, Marmota monax, also known as the woodchuck. Our sequencing strategy included a combination of short, high-quality Illumina reads plus long reads generated by both Pacific Biosciences and Oxford Nanopore instruments. Assembly of the combined data produced a genome of 2.74 Gbp in total length, with an N50 contig size of 1,094,236 bp. To annotate the genome, we mapped the genes from another M. monax genome and from the closely related Alpine marmot, Marmota marmota, onto our assembly, resulting in 20,559 annotated protein-coding genes and 28,135 transcripts. The genome assembly and annotation are available in GenBank under BioProject PRJNA587092.

Journal Article

Share this book

Add to My Shelf

LiftoffTools: a toolkit for comparing gene annotations mapped between genome assemblies version 1; peer review: 1 approved with reservations

by Shumate, Alaina , Salzberg, Steven

2022

In 2020 we published Liftoff, which was the first standalone tool specifically designed for transferring gene annotations between genome assemblies of the same or closely related species. While the gene content is expected to be very similar in closely related genomes, the differences may be biologically consequential, and a computational method to extract all gene-related differences should prove useful in the analysis of such genomes. Here we present LiftoffTools, a toolkit to automate the detection and analysis of gene sequence variants, synteny, and gene copy number changes. We provide a description of the toolkit and an example of its use comparing genes mapped between two human genome assemblies.

Journal Article

Share this book

Add to My Shelf

Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies

by Sović, Ivan , Koren, Sergey , Wood, Jonathan M. D. in 631/114/2785 , 631/1647/794 , 631/208/212/2302

2022

Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first telomere-to-telomere human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Although derived from highly accurate sequences, evaluation revealed evidence of small errors and structural misassemblies in the initial draft assembly. To correct these errors, we designed a new repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly quality value from 70.2 to 73.9 measured from PacBio high-fidelity and Illumina k -mers. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both high-fidelity and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies. The work describes the validation and polishing strategies developed by the telomere-to-telomere consortium for evaluating and improving the first complete human genome assembly.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter