Catalogue Search | MBRL

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

by Çelik, Muhammed Hasan , Kondratova, Liudmyla , Ren, Xingjie in 631/114/2184 , 631/1647/2217/2018 , 631/1647/514/1949

2024

The Long-read RNA-Seq Genome Annotation Assessment Project Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. Using different protocols and sequencing platforms, the consortium generated over 427 million long-read sequences from complementary DNA and direct RNA datasets, encompassing human, mouse and manatee species. Developers utilized these data to address challenges in transcript isoform detection, quantification and de novo transcript detection. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. Incorporating additional orthogonal data and replicate samples is advised when aiming to detect rare and novel transcripts or using reference-free approaches. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis. This Registered Report presents the results of the Long-read RNA-Seq Genome Annotation Assessment Project, which is a community effort for benchmarking long-read methods for transcriptome analyses, including transcript isoform detection, quantification and de novo transcript detection.

Journal Article

Share this book

Add to My Shelf

Detecting circular RNAs: bioinformatic and experimental challenges

by Szabo, Linda , Salzman, Julia in 631/114 , 631/208/1792 , 631/337/1645/1946

2016

Key Points In 2012, genome-wide statistical analysis of splicing led to the discovery of the global expression of circular RNA (circRNA) in eukaryotes and found that, in hundreds of human genes, circRNA constitutes the major isoform. circRNA expression was previously overlooked owing to a combination of biases in library preparation and heuristic filters imposed by algorithms to detect unannotated splicing events. Assigning reads to the correct splice junction is complicated by experimental artefacts, sequence homology and degenerate sequences at exon boundaries. Even accurate assignment to annotated splice junctions, a seemingly straightforward task compared with identifying unannotated splice events, has not been solved. Common RNA sequencing (RNA-seq) protocols introduce technical artefacts that can appear to be putative novel splice events, including circRNA. Statistical approaches can be used to test for these artefacts to avoid high false-positive rates, without the reduced sensitivity that comes with applying stringent bioinformatic filters. Read count is an unreliable metric when assessing whether a splice junction is truly expressed. Statistical approaches that reduce reliance on read count have improved the accuracy of novel linear splice detection, enabled the discovery of circRNAs spliced by the U12 (minor) spliceosome, and reduced false-positive circRNA owing to highly expressed homologous genes. There is little overlap in the predictions between published circRNA detection algorithms, and the field lacks a clear gold standard for assessing the accuracy of their genome-wide predictions. RNase R resistance is useful for validating a predicted circRNA, but more work is needed on normalization and appropriate enrichment tests for RNase R to be useful for assessing genome-wide accuracy. The ubiquitous expression of circRNA, as well as high circRNA expression from specific genes, is conserved across highly diverged eukaryotes. Conservation, as well as evidence of tissue- or development-specific regulation, provides circumstantial evidence that circRNAs are functional, although the function of most remains unknown. Circular RNAs (circRNAs) are pervasively expressed in eukaryotic genomes, representing the major transcript isoform for many genes. In this article, the authors review sources of experimental and bioinformatic biases that complicate the accurate discovery of circRNAs and discuss statistical approaches used by published algorithms to address these biases. The pervasive expression of circular RNAs (circRNAs) is a recently discovered feature of gene expression in highly diverged eukaryotes. Numerous algorithms that are used to detect genome-wide circRNA expression from RNA sequencing (RNA-seq) data have been developed in the past few years, but there is little overlap in their predictions and no clear gold-standard method to assess the accuracy of these algorithms. We review sources of experimental and bioinformatic biases that complicate the accurate discovery of circRNAs and discuss statistical approaches to address these biases. We conclude with a discussion of the current experimental progress on the topic.

Journal Article

Share this book

Add to My Shelf

Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline

by Chougule, Kapeel , Agda, Jireh R. A. , Ou, Shujun in Accuracy , Animal Genetics and Genomics , Animals

2019

Background Sequencing technology and assembly algorithms have matured to the point that high-quality de novo assembly is possible for large, repetitive genomes. Current assemblies traverse transposable elements (TEs) and provide an opportunity for comprehensive annotation of TEs. Numerous methods exist for annotation of each class of TEs, but their relative performances have not been systematically compared. Moreover, a comprehensive pipeline is needed to produce a non-redundant library of TEs for species lacking this resource to generate whole-genome TE annotations. Results We benchmark existing programs based on a carefully curated library of rice TEs. We evaluate the performance of methods annotating long terminal repeat (LTR) retrotransposons, terminal inverted repeat (TIR) transposons, short TIR transposons known as miniature inverted transposable elements (MITEs), and Helitrons. Performance metrics include sensitivity, specificity, accuracy, precision, FDR, and F 1 . Using the most robust programs, we create a comprehensive pipeline called Extensive de-novo TE Annotator (EDTA) that produces a filtered non-redundant TE library for annotation of structurally intact and fragmented elements. EDTA also deconvolutes nested TE insertions frequently found in highly repetitive genomic regions. Using other model species with curated TE libraries (maize and Drosophila), EDTA is shown to be robust across both plant and animal species. Conclusions The benchmarking results and pipeline developed here will greatly facilitate TE annotation in eukaryotic genomes. These annotations will promote a much more in-depth understanding of the diversity and evolution of TEs at both intra- and inter-species levels. EDTA is open-source and freely available: https://github.com/oushujun/EDTA .

Journal Article

Share this book

Add to My Shelf

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

by Ji, Zhicheng , Hou, Wenpin in 631/114/1305 , 631/114/2397 , 631/1647/794

2024

Here we demonstrate that the large language model GPT-4 can accurately annotate cell types using marker gene information in single-cell RNA sequencing analysis. When evaluated across hundreds of tissue and cell types, GPT-4 generates cell type annotations exhibiting strong concordance with manual annotations. This capability can considerably reduce the effort and expertise required for cell type annotation. Additionally, we have developed an R software package GPTCelltype for GPT-4’s automated cell type annotation. This study evaluates the performance of GPT-4 in single-cell type annotation.

Journal Article

Share this book

Add to My Shelf

SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms

by Conesa, Ana , Arzalluz-Luque, Angeles , Amorín, Rocío in 631/114/2164 , 631/114/794 , Annotations

2024

SQANTI3 is a tool designed for the quality control, curation and annotation of long-read transcript models obtained with third-generation sequencing technologies. Leveraging its annotation framework, SQANTI3 calculates quality descriptors of transcript models, junctions and transcript ends. With this information, potential artifacts can be identified and replaced with reliable sequences. Furthermore, the integrated functional annotation feature enables subsequent functional iso-transcriptomics analyses. SQANTI3 offers a flexible tool for quality control, curation and annotation of long-read RNA sequencing data.

Journal Article

Share this book

Add to My Shelf

The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

by Perovic, Vladimir R. , Veljkovic, Nevena , Antczak, Magdalena in Animal Genetics and Genomics , Animals , Annotations

2019

Background The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Results Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster , which we suspected of being involved in long-term memory. Conclusion We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster , it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.

Journal Article

Share this book

Add to My Shelf

The Ensembl Variant Effect Predictor

by Ritchie, Graham R. S. , Cunningham, Fiona , Riat, Harpreet Singh in Animal Genetics and Genomics , Annotations , Bioinformatics

2016

The Ensembl Variant Effect Predictor is a powerful toolset for the analysis, annotation, and prioritization of genomic variants in coding and non-coding regions. It provides access to an extensive collection of genomic annotation, with a variety of interfaces to suit different requirements, and simple options for configuring and extending analysis. It is open source, free to use, and supports full reproducibility of results. The Ensembl Variant Effect Predictor can simplify and accelerate variant interpretation in a wide range of study designs.

Journal Article

Share this book

Add to My Shelf

Next-generation transcriptome assembly

by Wang, Zhong , Martin, Jeffrey A. in 631/208/212/2019 , 631/208/514/1949 , 631/208/514/2254

2011

Key Points The protocols used for library construction, sequencing and data pre-processing can have a great impact on the quality of an assembled transcriptome and the accuracy of gene expression quantification. Before starting an RNA sequencing (RNA-seq) experiment, one should carefully consider using protocols that are strand-specific, that remove ribosomal RNA and that do not require PCR amplification of the template. Strand-specific RNA-seq protocols are important for correctly assembling overlapping transcripts, especially for compact genomes. The reference-based, or ab initio , assembly strategy requires a reference genome and uses much fewer computing resources than the de novo strategy. However, the quality of the genome and the ability of the short-read aligner to align reads across introns will directly influence the accuracy of the assembled transcripts when using the reference-based strategy. The de novo assembly strategy does not use a reference genome but instead uses a De Bruijn graph to represent overlaps between sequences and assemble transcripts. Most de novo approaches require significant computing resources: random access memory (RAM) is the typical limitation. However, de novo assemblers can assemble trans -spliced genes and novel transcripts that are not present in the genome assembly. To take full advantage of the current assembly strategies, a combined assembly approach should be considered that leverages the strengths of reference-based and de novo assembly strategies. Most transcriptome assemblers are still being developed, and the results from these programs should be evaluated using unbiased quantitative metrics. Transcriptome assembly involves an informatics approach to solve an experimental limitation. As sequencing strategies continually improve, it may no longer be necessary in the near future to assemble transcriptomes, as the read length will be longer than any individual transcript. Advances in sequencing technologies, assembly algorithms and computing power are making it feasible to assemble the entire transcriptome from short RNA reads. The article reviews the transcriptome assembly strategies, their advantages and limitations and how to apply them effectively. Transcriptomics studies often rely on partial reference transcriptomes that fail to capture the full catalogue of transcripts and their variations. Recent advances in sequencing technologies and assembly algorithms have facilitated the reconstruction of the entire transcriptome by deep RNA sequencing (RNA-seq), even without a reference genome. However, transcriptome assembly from billions of RNA-seq reads, which are often very short, poses a significant informatics challenge. This Review summarizes the recent developments in transcriptome assembly approaches — reference-based, de novo and combined strategies — along with some perspectives on transcriptome assembly in the near future.

Journal Article

Share this book

Add to My Shelf

A crowdsourcing open platform for literature curation in UniProt

by Wu, Cathy H. , Wang, Yuqi , Arighi, Cecilia N. in Amino acid sequence , Amino Acid Sequence - genetics , Annotations

2021

The UniProt knowledgebase is a public database for protein sequence and function, covering the tree of life and over 220 million protein entries. Now, the whole community can use a new crowdsourcing annotation system to help scale up UniProt curation and receive proper attribution for their biocuration work.

Journal Article

Share this book

Add to My Shelf

Quality assessment of gene repertoire annotations with OMArk

by Glover, Natasha M. , Warwick Vesztrocy, Alex , Nevers, Yannis in 631/114/1767 , 631/114/2184 , 631/181/735

2025

In the era of biodiversity genomics, it is crucial to ensure that annotations of protein-coding gene repertoires are accurate. State-of-the-art tools to assess genome annotations measure the completeness of a gene repertoire but are blind to other errors, such as gene overprediction or contamination. We introduce OMArk, a software package that relies on fast, alignment-free sequence comparisons between a query proteome and precomputed gene families across the tree of life. OMArk assesses not only the completeness but also the consistency of the gene repertoire as a whole relative to closely related species and reports likely contamination events. Analysis of 1,805 UniProt Eukaryotic Reference Proteomes with OMArk demonstrated strong evidence of contamination in 73 proteomes and identified error propagation in avian gene annotation resulting from the use of a fragmented zebra finch proteome as a reference. This study illustrates the importance of comparing and prioritizing proteomes based on their quality measures. A new tool checks the quality of gene annotations in genome sequences.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter