Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
      More Filters
      Clear All
      More Filters
      Source
    • Language
8,367 result(s) for "Molecular Sequence Annotation"
Sort by:
The status of the human gene catalogue
Scientists have been trying to identify every gene in the human genome since the initial draft was published in 2001. In the years since, much progress has been made in identifying protein-coding genes, currently estimated to number fewer than 20,000, with an ever-expanding number of distinct protein-coding isoforms. Here we review the status of the human gene catalogue and the efforts to complete it in recent years. Beside the ongoing annotation of protein-coding genes, their isoforms and pseudogenes, the invention of high-throughput RNA sequencing and other technological breakthroughs have led to a rapid growth in the number of reported non-coding RNA genes. For most of these non-coding RNAs, the functional relevance is currently unclear; we look at recent advances that offer paths forward to identifying their functions and towards eventually completing the human gene catalogue. Finally, we examine the need for a universal annotation standard that includes all medically significant genes and maintains their relationships with different reference genomes for the use of the human gene catalogue in clinical settings. Although the catalogue of human protein-coding genes is nearing completion, the number of non-coding RNA genes remains highly uncertain, and for all genes much work remains to be done to understand their functions.
Detecting circular RNAs: bioinformatic and experimental challenges
Key Points In 2012, genome-wide statistical analysis of splicing led to the discovery of the global expression of circular RNA (circRNA) in eukaryotes and found that, in hundreds of human genes, circRNA constitutes the major isoform. circRNA expression was previously overlooked owing to a combination of biases in library preparation and heuristic filters imposed by algorithms to detect unannotated splicing events. Assigning reads to the correct splice junction is complicated by experimental artefacts, sequence homology and degenerate sequences at exon boundaries. Even accurate assignment to annotated splice junctions, a seemingly straightforward task compared with identifying unannotated splice events, has not been solved. Common RNA sequencing (RNA-seq) protocols introduce technical artefacts that can appear to be putative novel splice events, including circRNA. Statistical approaches can be used to test for these artefacts to avoid high false-positive rates, without the reduced sensitivity that comes with applying stringent bioinformatic filters. Read count is an unreliable metric when assessing whether a splice junction is truly expressed. Statistical approaches that reduce reliance on read count have improved the accuracy of novel linear splice detection, enabled the discovery of circRNAs spliced by the U12 (minor) spliceosome, and reduced false-positive circRNA owing to highly expressed homologous genes. There is little overlap in the predictions between published circRNA detection algorithms, and the field lacks a clear gold standard for assessing the accuracy of their genome-wide predictions. RNase R resistance is useful for validating a predicted circRNA, but more work is needed on normalization and appropriate enrichment tests for RNase R to be useful for assessing genome-wide accuracy. The ubiquitous expression of circRNA, as well as high circRNA expression from specific genes, is conserved across highly diverged eukaryotes. Conservation, as well as evidence of tissue- or development-specific regulation, provides circumstantial evidence that circRNAs are functional, although the function of most remains unknown. Circular RNAs (circRNAs) are pervasively expressed in eukaryotic genomes, representing the major transcript isoform for many genes. In this article, the authors review sources of experimental and bioinformatic biases that complicate the accurate discovery of circRNAs and discuss statistical approaches used by published algorithms to address these biases. The pervasive expression of circular RNAs (circRNAs) is a recently discovered feature of gene expression in highly diverged eukaryotes. Numerous algorithms that are used to detect genome-wide circRNA expression from RNA sequencing (RNA-seq) data have been developed in the past few years, but there is little overlap in their predictions and no clear gold-standard method to assess the accuracy of these algorithms. We review sources of experimental and bioinformatic biases that complicate the accurate discovery of circRNAs and discuss statistical approaches to address these biases. We conclude with a discussion of the current experimental progress on the topic.
The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens
Background The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Results Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster , which we suspected of being involved in long-term memory. Conclusion We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster , it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.
Next-generation transcriptome assembly
Key Points The protocols used for library construction, sequencing and data pre-processing can have a great impact on the quality of an assembled transcriptome and the accuracy of gene expression quantification. Before starting an RNA sequencing (RNA-seq) experiment, one should carefully consider using protocols that are strand-specific, that remove ribosomal RNA and that do not require PCR amplification of the template. Strand-specific RNA-seq protocols are important for correctly assembling overlapping transcripts, especially for compact genomes. The reference-based, or ab initio , assembly strategy requires a reference genome and uses much fewer computing resources than the de novo strategy. However, the quality of the genome and the ability of the short-read aligner to align reads across introns will directly influence the accuracy of the assembled transcripts when using the reference-based strategy. The de novo assembly strategy does not use a reference genome but instead uses a De Bruijn graph to represent overlaps between sequences and assemble transcripts. Most de novo approaches require significant computing resources: random access memory (RAM) is the typical limitation. However, de novo assemblers can assemble trans -spliced genes and novel transcripts that are not present in the genome assembly. To take full advantage of the current assembly strategies, a combined assembly approach should be considered that leverages the strengths of reference-based and de novo assembly strategies. Most transcriptome assemblers are still being developed, and the results from these programs should be evaluated using unbiased quantitative metrics. Transcriptome assembly involves an informatics approach to solve an experimental limitation. As sequencing strategies continually improve, it may no longer be necessary in the near future to assemble transcriptomes, as the read length will be longer than any individual transcript. Advances in sequencing technologies, assembly algorithms and computing power are making it feasible to assemble the entire transcriptome from short RNA reads. The article reviews the transcriptome assembly strategies, their advantages and limitations and how to apply them effectively. Transcriptomics studies often rely on partial reference transcriptomes that fail to capture the full catalogue of transcripts and their variations. Recent advances in sequencing technologies and assembly algorithms have facilitated the reconstruction of the entire transcriptome by deep RNA sequencing (RNA-seq), even without a reference genome. However, transcriptome assembly from billions of RNA-seq reads, which are often very short, poses a significant informatics challenge. This Review summarizes the recent developments in transcriptome assembly approaches — reference-based, de novo and combined strategies — along with some perspectives on transcriptome assembly in the near future.
A promoter-level mammalian expression atlas
Regulated transcription controls the diversity, developmental pathways and spatial organization of the hundreds of cell types that make up a mammal. Using single-molecule cDNA sequencing, we mapped transcription start sites (TSSs) and their usage in human and mouse primary cells, cell lines and tissues to produce a comprehensive overview of mammalian gene expression across the human body. We find that few genes are truly ‘housekeeping’, whereas many mammalian promoters are composite entities composed of several closely separated TSSs, with independent cell-type-specific expression profiles. TSSs specific to different cell types evolve at different rates, whereas promoters of broadly expressed genes are the most conserved. Promoter-based expression analysis reveals key transcription factors defining cell states and links them to binding-site motifs. The functions of identified novel transcripts can be predicted by coexpression and sample ontology enrichment analyses. The functional annotation of the mammalian genome 5 (FANTOM5) project provides comprehensive expression profiles and functional annotation of mammalian cell-type-specific transcriptomes with wide applications in biomedical research. A study from the FANTOM consortium using single-molecule cDNA sequencing of transcription start sites and their usage in human and mouse primary cells, cell lines and tissues reveals insights into the specificity and diversity of transcription patterns across different mammalian cell types. Mapping the human transcription FANTOM5 (standing for functional annotation of the mammalian genome 5) is the fifth major stage of a major international collaboration that aims to dissect the transcriptional regulatory networks that define every human cell type. Two Articles in this issue of Nature present some of the project's latest results. The first paper uses the FANTOM5 panel of tissue and primary cell samples to define an atlas of active, in vivo bidirectionally transcribed enhancers across the human body. These authors show that bidirectional capped RNAs are a signature feature of active enhancers and identify more than 40,000 enhancer candidates from over 800 human cell and tissue samples. The enhancer atlas is used to compare regulatory programs between different cell types and identify disease-associated regulatory SNPs, and will be a resource for studies on cell-type-specific enhancers. In the second paper, single-molecule sequencing is used to map human and mouse transcription start sites and their usage in a panel of distinct human and mouse primary cells, cell lines and tissues to produce the most comprehensive mammalian gene expression atlas to date. The data provide a plethora of insights into open reading frames and promoters across different cell types in addition to valuable annotation of mammalian cell-type-specific transcriptomes.
Twelve quick steps for genome assembly and annotation in the classroom
Eukaryotic genome sequencing and de novo assembly, once the exclusive domain of well-funded international consortia, have become increasingly affordable, thus fitting the budgets of individual research groups. Third-generation long-read DNA sequencing technologies are increasingly used, providing extensive genomic toolkits that were once reserved for a few select model organisms. Generating high-quality genome assemblies and annotations for many aquatic species still presents significant challenges due to their large genome sizes, complexity, and high chromosome numbers. Indeed, selecting the most appropriate sequencing and software platforms and annotation pipelines for a new genome project can be daunting because tools often only work in limited contexts. In genomics, generating a high-quality genome assembly/annotation has become an indispensable tool for better understanding the biology of any species. Herein, we state 12 steps to help researchers get started in genome projects by presenting guidelines that are broadly applicable (to any species), sustainable over time, and cover all aspects of genome assembly and annotation projects from start to finish. We review some commonly used approaches, including practical methods to extract high-quality DNA and choices for the best sequencing platforms and library preparations. In addition, we discuss the range of potential bioinformatics pipelines, including structural and functional annotations (e.g., transposable elements and repetitive sequences). This paper also includes information on how to build a wide community for a genome project, the importance of data management, and how to make the data and results Findable, Accessible, Interoperable, and Reusable (FAIR) by submitting them to a public repository and sharing them with the research community.
Genome Sequence of the Tsetse Fly (Glossina morsitans): Vector of African Trypanosomiasis
Tsetse flies are the sole vectors of human African trypanosomiasis throughout sub-Saharan Africa. Both sexes of adult tsetse feed exclusively on blood and contribute to disease transmission. Notable differences between tsetse and other disease vectors include obligate microbial symbioses, viviparous reproduction, and lactation. Here, we describe the sequence and annotation of the 366-megabase Glossina morsitans morsitans genome. Analysis of the genome and the 12,308 predicted protein–encoding genes led to multiple discoveries, including chromosomal integrations of bacterial (Wolbachia) genome sequences, a family of lactation-specific proteins, reduced complement of host pathogen recognition proteins, and reduced olfaction/chemosensory associated genes. These genome data provide a foundation for research into trypanosomiasis prevention and yield important insights with broad implications for multiple aspects of tsetse biology.
Haplotype phasing: existing methods and new developments
Key Points Haplotype phase may be generated through either computational or experimental methods. Computational phasing is simple and inexpensive and results in good accuracy for common variants over small regions. Computational phasing of closely related individuals (such as parent–offspring trios) results in high accuracy at a high proportion of sites because of the additional information provided by Mendelian constraints. Although specialized software for analysing complex relationships is somewhat limited, good results can be obtained by treating the related individuals as if they were unrelated when performing computational phasing. A new development in computational phasing of unrelated individuals is the detection and use of segments of identity-by-descent that arise from distant relationships. In their current form, these methods are only suitable for small, isolated populations, but improvements in algorithms may lead to applicability to large samples from outbred populations. Experimental phasing has a very high accuracy at a high proportion of sites and can phase de novo or very rare variants without the need to obtain data from closely related individuals. Experimental phasing currently adds substantially to the cost of generating the genotype or sequence data (at least doubling the cost) and requires technical expertise, additional preparation time and, in some cases, specialized equipment. The authors review the experimental and computational approaches for determining haplotype phase, focusing on statistical methods, the factors that influence the strategy used and the value of using information on identity-by-descent. Determination of haplotype phase is becoming increasingly important as we enter the era of large-scale sequencing because many of its applications, such as imputing low-frequency variants and characterizing the relationship between genetic variation and disease susceptibility, are particularly relevant to sequence data. Haplotype phase can be generated through laboratory-based experimental methods, or it can be estimated using computational approaches. We assess the haplotype phasing methods that are available, focusing in particular on statistical methods, and we discuss the practical aspects of their application. We also describe recent developments that may transform this field, particularly the use of identity-by-descent for computational phasing.
Comparative genomics and community curation further improve gene annotations in the nematode Pristionchus pacificus
Background Nematode model organisms such as Caenorhabditis elegans and Pristionchus pacificus are powerful systems for studying the evolution of gene function at a mechanistic level. However, the identification of P. pacificus orthologs of candidate genes known from C. elegans is complicated by the discrepancy in the quality of gene annotations, a common problem in nematode and invertebrate genomics. Results Here, we combine comparative genomic screens for suspicious gene models with community-based curation to further improve the quality of gene annotations in P. pacificus . We extend previous curations of one-to-one orthologs to larger gene families and also orphan genes. Cross-species comparisons of protein lengths, screens for atypical domain combinations and species-specific orphan genes resulted in 4311 candidate genes that were subject to community-based curation. Corrections for 2946 gene models were implemented in a new version of the P. pacificus gene annotations. The new set of gene annotations contains 28,896 genes and has a single copy ortholog completeness level of 97.6%. Conclusions Our work demonstrates the effectiveness of comparative genomic screens to identify suspicious gene models and the scalability of community-based approaches to improve the quality of thousands of gene models. Similar community-based approaches can help to improve the quality of gene annotations in other invertebrate species, including parasitic nematodes.
SEED Servers: High-Performance Access to the SEED Genomes, Annotations, and Metabolic Models
The remarkable advance in sequencing technology and the rising interest in medical and environmental microbiology, biotechnology, and synthetic biology resulted in a deluge of published microbial genomes. Yet, genome annotation, comparison, and modeling remain a major bottleneck to the translation of sequence information into biological knowledge, hence computational analysis tools are continuously being developed for rapid genome annotation and interpretation. Among the earliest, most comprehensive resources for prokaryotic genome analysis, the SEED project, initiated in 2003 as an integration of genomic data and analysis tools, now contains >5,000 complete genomes, a constantly updated set of curated annotations embodied in a large and growing collection of encoded subsystems, a derived set of protein families, and hundreds of genome-scale metabolic models. Until recently, however, maintaining current copies of the SEED code and data at remote locations has been a pressing issue. To allow high-performance remote access to the SEED database, we developed the SEED Servers (http://www.theseed.org/servers): four network-based servers intended to expose the data in the underlying relational database, support basic annotation services, offer programmatic access to the capabilities of the RAST annotation server, and provide access to a growing collection of metabolic models that support flux balance analysis. The SEED servers offer open access to regularly updated data, the ability to annotate prokaryotic genomes, the ability to create metabolic reconstructions and detailed models of metabolism, and access to hundreds of existing metabolic models. This work offers and supports a framework upon which other groups can build independent research efforts. Large integrations of genomic data represent one of the major intellectual resources driving research in biology, and programmatic access to the SEED data will provide significant utility to a broad collection of potential users.