Catalogue Search | MBRL

Magic-BLAST, an accurate RNA-seq aligner for long and short reads

by Boratyn, Grzegorz M. , Madden, Thomas L. , Busby, Ben in Algorithms , Alignment , Benchmarking

2019

Background Next-generation sequencing technologies can produce tens of millions of reads, often paired-end, from transcripts or genomes. But few programs can align RNA on the genome and accurately discover introns, especially with long reads. We introduce Magic-BLAST, a new aligner based on ideas from the Magic pipeline. Results Magic-BLAST uses innovative techniques that include the optimization of a spliced alignment score and selective masking during seed selection. We evaluate the performance of Magic-BLAST to accurately map short or long sequences and its ability to discover introns on real RNA-seq data sets from PacBio, Roche and Illumina runs, and on six benchmarks, and compare it to other popular aligners. Additionally, we look at alignments of human idealized RefSeq mRNA sequences perfectly matching the genome. Conclusions We show that Magic-BLAST is the best at intron discovery over a wide range of conditions and the best at mapping reads longer than 250 bases, from any platform. It is versatile and robust to high levels of mismatches or extreme base composition, and reasonably fast. It can align reads to a BLAST database or a FASTA file. It can accept a FASTQ file as input or automatically retrieve an accession from the SRA repository at the NCBI.

Journal Article

Share this book

Add to My Shelf

Predicting drug-metagenome interactions: Variation in the microbial β-glucuronidase level in the human gut metagenomes

by Elmassry, Moamen M. , Busby, Ben , Kim, Sunghwan in Adult , Bacteria - genetics , bacteroides

2021

Characterizing the gut microbiota in terms of their capacity to interfere with drug metabolism is necessary to achieve drug efficacy and safety. Although examples of drug-microbiome interactions are well-documented, little has been reported about a computational pipeline for systematically identifying and characterizing bacterial enzymes that process particular classes of drugs. The goal of our study is to develop a computational approach that compiles drugs whose metabolism may be influenced by a particular class of microbial enzymes and that quantifies the variability in the collective level of those enzymes among individuals. The present paper describes this approach, with microbial β-glucuronidases as an example, which break down drug-glucuronide conjugates and reactivate the drugs or their metabolites. We identified 100 medications that may be metabolized by β-glucuronidases from the gut microbiome. These medications included morphine, estrogen, ibuprofen, midazolam, and their structural analogues. The analysis of metagenomic data available through the Sequence Read Archive (SRA) showed that the level of β-glucuronidase in the gut metagenomes was higher in males than in females, which provides a potential explanation for the sex-based differences in efficacy and toxicity for several drugs, reported in previous studies. Our analysis also showed that infant gut metagenomes at birth and 12 months of age have higher levels of β-glucuronidase than the metagenomes of their mothers and the implication of this observed variability was discussed in the context of breastfeeding as well as infant hyperbilirubinemia. Overall, despite important limitations discussed in this paper, our analysis provided useful insights on the role of the human gut metagenome in the variability in drug response among individuals. Importantly, this approach exploits drug and metagenome data available in public databases as well as open-source cheminformatics and bioinformatics tools to predict drug-metagenome interactions.

Journal Article

Share this book

Add to My Shelf

The GIAB genomic stratifications resource for human reference genomes

by Sedlazeck, Fritz J. , Jadhav, Bharati , Huang, Wenyu (Eddy) in 631/114/129 , 631/114/2416 , 631/1647/2217

2024

Despite the growing variety of sequencing and variant-calling tools, no workflow performs equally well across the entire human genome. Understanding context-dependent performance is critical for enabling researchers, clinicians, and developers to make informed tradeoffs when selecting sequencing hardware and software. Here we describe a set of “stratifications,” which are BED files that define distinct contexts throughout the genome. We define these for GRCh37/38 as well as the new T2T-CHM13 reference, adding many new hard-to-sequence regions which are critical for understanding performance as the field progresses. Specifically, we highlight the increase in hard-to-map and GC-rich stratifications in CHM13 relative to the previous references. We then compare the benchmarking performance with each reference and show the performance penalty brought about by these additional difficult regions in CHM13. Additionally, we demonstrate how the stratifications can track context-specific improvements over different platform iterations, using Oxford Nanopore Technologies as an example. The means to generate these stratifications are available as a snakemake pipeline at https://github.com/usnistgov/giab-stratifications . We anticipate this being useful in enabling precise risk-reward calculations when building sequencing pipelines for any of the commonly-used reference genomes. The GIAB genomic stratification resource defines challenging regions in three commonly used human genome references, including the first complete human genome (CHM13). These help understand strengths and weaknesses of sequencing and analysis methods.

Journal Article

Share this book

Add to My Shelf

The challenge of chromatin model comparison and validation: A project from the first international 4D Nucleome Hackathon

by Kubica, Jędrzej , Korsak, Sevastianos , Banecki, Krzysztof H. in Bioinformatics , Biology and Life Sciences , Biophysics

2025

The computational modeling of chromatin structure is highly complex due to the hierarchical organization of chromatin, which reflects its diverse biophysical principles, as well as inherent dynamism, which underlies its complexity. Chromatin structure modeling can be based on diverse approaches and assumptions, making it essential to determine how different methods influence the modeling outcomes. We conducted a project at the NIH-funded 4D Nucleome Hackathon on March 18–21, 2024, at The University of Washington in Seattle, USA. The hackathon provided an amazing opportunity to gather an international, multi-institutional and unbiased group of experts to discuss, understand and undertake the challenges of chromatin model comparison and validation. Here we give an overview of the current state of the 3D chromatin field and discuss our efforts to run and validate the models. We used distance matrices to represent chromatin models and we calculated Spearman correlation coefficients to estimate differences between models, as well as between models and experimental data. In addition, we discuss challenges in chromatin structure modeling that include: 1) different aspects of chromatin biophysics and scales complicate model comparisons, 2) large diversity of experimental data (e.g., population-based, single-cell, protein-specific) that differ in mathematical properties, heatmap smoothness, noise and resolutions complicates model validation, 3) expertise in biology, bioinformatics, and physics is necessary to conduct comprehensive research on chromatin structure, 4) bioinformatic software, which is often developed in academic settings, is characterized by insufficient support and documentation. We also emphasize the importance of establishing guidelines for software development and standardization.

Journal Article

Share this book

Add to My Shelf

An international consensus on effective, inclusive, and career-spanning short-format training in the life sciences and beyond

by Dillman, Allissa A. , Hall, Christina R. , Nisselle, Amy in Accessibility , Adult learning , Bicycles

2023

Science, technology, engineering, mathematics, and medicine (STEMM) fields change rapidly and are increasingly interdisciplinary. Commonly, STEMM practitioners use short-format training (SFT) such as workshops and short courses for upskilling and reskilling, but unaddressed challenges limit SFT’s effectiveness and inclusiveness. Education researchers, students in SFT courses, and organizations have called for research and strategies that can strengthen SFT in terms of effectiveness, inclusiveness, and accessibility across multiple dimensions. This paper describes the project that resulted in a consensus set of 14 actionable recommendations to systematically strengthen SFT. A diverse international group of 30 experts in education, accessibility, and life sciences came together from 10 countries to develop recommendations that can help strengthen SFT globally. Participants, including representation from some of the largest life science training programs globally, assembled findings in the educational sciences and encompassed the experiences of several of the largest life science SFT programs. The 14 recommendations were derived through a Delphi method, where consensus was achieved in real time as the group completed a series of meetings and tasks designed to elicit specific recommendations. Recommendations cover the breadth of SFT contexts and stakeholder groups and include actions for instructors (e.g., make equity and inclusion an ethical obligation), programs (e.g., centralize infrastructure for assessment and evaluation), as well as organizations and funders (e.g., professionalize training SFT instructors; deploy SFT to counter inequity). Recommendations are aligned with a purpose-built framework—“The Bicycle Principles”—that prioritizes evidenced-based teaching, inclusiveness, and equity, as well as the ability to scale, share, and sustain SFT. We also describe how the Bicycle Principles and recommendations are consistent with educational change theories and can overcome systemic barriers to delivering consistently effective, inclusive, and career-spanning SFT.

Journal Article

Share this book

Add to My Shelf

Correction: An international consensus on effective, inclusive, and career-spanning short-format training in the life sciences and beyond

by Dillman, Allissa A. , Hall, Christina R. , Nisselle, Amy

2025

[This corrects the article DOI: 10.1371/journal.pone.0293879.].

Journal Article

Share this book

Add to My Shelf

geneHummus: an R package to define gene families and their expression in legumes and beyond

by Elmassry, Moamen M. , LeBlanc, Kimberly H. , Dillman, Allissa in Animal Genetics and Genomics , Bioinformatics , Biomedical and Life Sciences

2019

Background During the last decade, plant biotechnological laboratories have sparked a monumental revolution with the rapid development of next sequencing technologies at affordable prices. Soon, these sequencing technologies and assembling of whole genomes will extend beyond the plant computational biologists and become commonplace within the plant biology disciplines. The current availability of large-scale genomic resources for non-traditional plant model systems (the so-called ‘orphan crops’) is enabling the construction of high-density integrated physical and genetic linkage maps with potential applications in plant breeding. The newly available fully sequenced plant genomes represent an incredible opportunity for comparative analyses that may reveal new aspects of genome biology and evolution. The analysis of the expansion and evolution of gene families across species is a common approach to infer biological functions. To date, the extent and role of gene families in plants has only been partially addressed and many gene families remain to be investigated. Manual identification of gene families is highly time-consuming and laborious, requiring an iterative process of manual and computational analysis to identify members of a given family, typically combining numerous BLAST searches and manually cleaning data. Due to the increasing abundance of genome sequences and the agronomical interest in plant gene families, the field needs a clear, automated annotation tool. Results Here, we present the geneHummus package, an R-based pipeline for the identification and characterization of plant gene families. The impact of this pipeline comes from a reduction in hands-on annotation time combined with high specificity and sensitivity in extracting only proteins from the RefSeq database and providing the conserved domain architectures based on SPARCLE. As a case study we focused on the auxin receptor factors gene (ARF) family in Cicer arietinum (chickpea) and other legumes. Conclusion We anticipate that our pipeline should be suitable for any taxonomic plant family, and likely other gene families, vastly improving the speed and ease of genomic data processing.

Journal Article

Share this book

Add to My Shelf

DangerTrack: A scoring system to detect difficult-to-assess regions

by Sedlazeck, Fritz , Dolgalev, Igor , Busby, Ben in Bioinformatics , Genomics , Software Tool

2017

Over recent years, multiple groups have shown that a large number of structural variants, repeats, or problems with the underlying genome assembly have dramatic effects on the mapping, calling, and overall reliability of single nucleotide polymorphism calls. This project endeavored to develop an easy-to-use track for looking at structural variant and repeat regions. This track, DangerTrack, can be displayed alongside the existing Genome Reference Consortium assembly tracks to warn clinicians and biologists when variants of interest may be incorrectly called, of dubious quality, or on an insertion or copy number expansion. While mapping and variant calling can be automated, it is our opinion that when these regions are of interest to a particular clinical or research group, they warrant a careful examination, potentially involving localized reassembly. DangerTrack is available at https://github.com/DCGenomics/DangerTrack .

Journal Article

Share this book

Add to My Shelf

Closing gaps between open software and public data in a hackathon setting: User-centered software prototyping version 1; peer review: not peer reviewed

by Busby, Ben , Lesko, Matthew , Federer, Lisa in Bioinformatics , Collaboration , Computer programs

2016

In genomics, bioinformatics and other areas of data science, gaps exist between extant public datasets and the open-source software tools built by the community to analyze similar data types. The purpose of biological data science hackathons is to assemble groups of genomics or bioinformatics professionals and software developers to rapidly prototype software to address these gaps. The only two rules for the NCBI-assisted hackathons run so far are that 1) data either must be housed in public data repositories or be deposited to such repositories shortly after the hackathon's conclusion, and 2) all software comprising the final pipeline must be open-source or open-use. Proposed topics, as well as suggested tools and approaches, are distributed to participants at the beginning of each hackathon and refined during the event. Software, scripts, and pipelines are developed and published on GitHub, a web service providing publicly available, free-usage tiers for collaborative software development. The code resulting from each hackathon is published at https://github.com/NCBI-Hackathons/ with separate directories or repositories for each team.

Journal Article

Share this book

Add to My Shelf

Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive version 2; peer review: 2 approved

by Latt, Khun Zaw , Bernstein, Matthew N , Dillman, Allissa in Annotations , Archives & records , Biological Ontologies

2020

The Sequence Read Archive (SRA) is a large public repository that stores raw next-generation sequencing data from thousands of diverse scientific investigations. Despite its promise, reuse and re-analysis of SRA data has been challenged by the heterogeneity and poor quality of the metadata that describe its biological samples. Recently, the MetaSRA project standardized these metadata by annotating each sample with terms from biomedical ontologies. In this work, we present a pair of Jupyter notebook-based tools that utilize the MetaSRA for building structured datasets from the SRA in order to facilitate secondary analyses of the SRA's human RNA-seq data. The first tool, called the Case-Control Finder, finds suitable case and control samples for a given disease or condition where the cases and controls are matched by tissue or cell type. The second tool, called the Series Finder, finds ordered sets of samples for the purpose of addressing biological questions pertaining to changes over a numerical property such as time. These tools were the result of a three-day-long NCBI Codeathon in March 2019 held at the University of North Carolina at Chapel Hill.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter