Catalogue Search | MBRL

Harmonizing and integrating the NCI Genomic Data Commons through accessible, interactive, and cloud-enabled workflows

by Yeung, Ka Yee , Hoang, Varik , Hung, Ling-Hong in Accessibility , Analysis , Annotations

2025

Cancer data is widely available in repositories such as the National Cancer Institute (NCI) Genomic Data Commons (GDC). These datasets could serve as controls or comparisons in compendium analyses with user data, avoiding the expense and time of generating additional datasets. However, the user must be able to process their new data in the same manner for these comparisons to be useful. This can be non-trivial. Although the executables themselves are usually available in repositories, the GDC pipelines that describe that entire analysis workflow are currently published as text-based standard operating procedures (SOPs). It is difficult to document a computational workflow to the level of detail and accuracy required to reproduce the results. Discrepancies between versions and exclusions of details accumulate as the documentation inevitably lags behind code revisions. Our goal is to enhance the utility of the GDC by converting the SOPs into an accessible and executable format. Specifically, we converted the GDC DNA sequencing (DNA-Seq) and the GDC mRNA sequencing (mRNA-Seq) SOPs into reproducible, self-installing, containerized, and interactive graphical workflows. These can be applied to reproducibly process user data and to harmonize datasets across repositories. Using our publicly available graphical workflows, we harmonize raw RNA-Seq datasets from the GDC and the Genotype-Tissue Expression (GTEx) project that were originally processed using different methodologies to illustrate the importance of uniform processing of control and treatment data for accurate inference of differentially expressed genes. By disseminating the analytical methodology in a reproducible and executable form, we greatly increase the utility of the GDC by enabling researchers to uniformly process custom data and datasets across multiple repositories to enhance data interpretation. Our approach and open-source executable workflows of making the analytical process as readily available as the data can be applied to other data repositories to increase their impact on scientific research.

Journal Article

Share this book

Add to My Shelf

MorPhiC Consortium: towards functional characterization of all human genes

by Li, Sheng , Smedley, Damian , Skarnes, William C. in 45/41 , 631/136/532/2064/2158 , 631/1647/1407/2163

2025

Recent advances in functional genomics and human cellular models have substantially enhanced our understanding of the structure and regulation of the human genome. However, our grasp of the molecular functions of human genes remains incomplete and biased towards specific gene classes. The Molecular Phenotypes of Null Alleles in Cells (MorPhiC) Consortium aims to address this gap by creating a comprehensive catalogue of the molecular and cellular phenotypes associated with null alleles of all human genes using in vitro multicellular systems. In this Perspective, we present the strategic vision of the MorPhiC Consortium and discuss various strategies for generating null alleles, as well as the challenges involved. We describe the cellular models and scalable phenotypic readouts that will be used in the consortium’s initial phase, focusing on 1,000 protein-coding genes. The resulting molecular and cellular data will be compiled into a catalogue of null-allele phenotypes. The methodologies developed in this phase will establish best practices for extending these approaches to all human protein-coding genes. The resources generated—including engineered cell lines, plasmids, phenotypic data, genomic information and computational tools—will be made available to the broader research community to facilitate deeper insights into human gene functions. This Perspective discusses strategies and challenges for the Molecular Phenotypes of Null Alleles in Cells (MorPhiC) Consortium as it aims to catalogue the molecular and cellular phenotypes associated with null alleles of all human genes.

Journal Article

Share this book

Add to My Shelf

User-friendly scheduler Using a hybrid architecture and supercomputing for big data processing

by Yeung, Ka Yee , Hung, Ling-Hong , McKeever, Patrick in Bioinformatics

2025

The exponential growth of omics data requires novel strategies for storage, transfer, and processing of said data. We present a scheduler based on the Temporal.io workflow framework which enables two key optimizations of bioinformatics workflows. Firstly, we enable users to transparently map workflow steps to diverse execution environments, including high-performance computing (HPC) resources managed by the SLURM resource manager through an easy-to-use graphical user interface. Secondly, we enable asynchronous execution of workflows, a feature which guarantees that workflows will achieve reasonable resource utilization even when the scheduler cannot make use of a system's full RAM and CPU resources. Thirdly, we propose a universal, platform agnostic JSON representation of workflows that allows platform-specific execution details to be abstracted away from the core scientific logic. Our work includes a custom executor plugin that supports translation of workflows from an external language, such as Nextflow, to our universal JSON format. Finally, we develop a graphical user interface to make our scheduler easy-to-use for non-technical users. When benchmarked on a bulk RNA sequencing workflow, these features reduced the cost and time requirements. We illustrated the merits of our cross-platform method using credit allocations from federally funded supercomputers.

Journal Article

Share this book

Add to My Shelf

Uncovering the Pathophysiological Pattern of Expression from Integrated Analysis across Uniformly Processed RNA Sequencing COVID-19 Datasets

by Cole, Andrew , Yeung, Ka Yee , Hung, Ling-Hong

2025

Post-acute sequelae of SARS-CoV-2 infection (PASC) affects millions globally, yet the molecular mechanisms underlying acute COVID-19 and its chronic sequelae remain poorly understood. We performed an integrative transcriptomic analysis of three independent RNA-seq datasets, capturing the complete COVID-19 pathophysiology from health through acute severe infection to post-acute sequelae and mortality (n=142 total samples). We implemented a containerized analytical pipeline from data download, quantification, differential gene expression to uniformly process these three RNA-seq datasets. Our analysis reveals striking molecular dichotomies contrasting disease phases with profound clinical implications. Acute severe/critical COVID-19 reveals predominant enrichment of TNF-α signaling via NF-κB pathways (normalized enrichment score >2.5, FDR <0.001), reflecting a cytokine storm pathophysiology characterized by rapid inflammatory developments involving IL-6, TNF-α, and anti-apoptotic responses. In contrast, PASC patients exhibit dominant enrichment of Myc Targets V1 and Oxidative Phosphorylation pathways (NES >2.2, FDR <0.005), indicating important shifts toward cellular adaptation. Pathway signature analysis identifies core differentially expressed genes that reliably distinguish disease phases, thereby offering objective biomarkers for precision diagnosis and monitoring. These findings establish a comprehensive molecular framework distinguishing acute inflammatory from chronic metabolic COVID-19 phases, with potential clinical applicability. TNF-α/NF-κB pathway signatures identify patients at risk for severe disease progression, while Myc/OXPHOS signatures allow objective PASC diagnosis, addressing current reliance on subjective and eliminative diagnosis. This integrative analytical framework has utility beyond COVID-19, offering an applicable approach for precision medicine implementation across other diseases processes. This study transforms COVID-19 from a symptom-based to a molecularly-defined disease spectrum, enabling precision diagnosis, prognostic monitoring, classification, and targeted therapeutic possibilities based on pathway-specific biomarkers rather than subjective clinical assessments.

Journal Article

Share this book

Add to My Shelf

Graphical and Interactive Spatial Proteomics Image Analysis Workflow

by Hung, Ling-Hong , Smythe, Kimberly S , Yeung, Cecilia Cs in Bioinformatics

2025

Spatial proteomics provides a spatially resolved view of protein expression and localization within cells and tissues by mapping the location and abundance of proteins. There is a need for fully-integrated end-to-end imaging workflows for spatial proteomic analysis that are flexible, high-throughput, and support graphical and interactive visualizations. We present a modular and interactive spatial proteomic image analysis workflow with individual steps containerized that empowers biomedical researchers to reproducibly execute and customize complex analyses. Our workflow consists of cell segmentation, unsupervised clustering, validation of clusters on the image, and cell type clustering results visualization. Users can utilize a form-based graphical interface to execute and customize multi-step workflows with a single click or interactively adjust image processing steps within the workflow, apply workflows to various datasets, and modify input parameters as needed. We illustrated the functionality of our workflow using a cancer imaging dataset consisting of a tissue microarray (TMA) stained by high-plex immunohistochemistry. This TMA contained a variety of cancer and tissue cell types to assess the broad applicability of this workflow to different biopsy and tissue types.

Journal Article

Share this book

Add to My Shelf

Harmonizing and integrating the NCI Genomic Data Commons through accessible, interactive, and cloud-enabled workflows

by Yeung, Ka Yee , Hoang, Varik , Schmitz, Robert in Bioinformatics , Datasets , DNA sequencing

2024

Cancer data is widely available in repositories such as the National Cancer Institute (NCI) Genomic Data Commons (GDC). These datasets could serve as controls or comparisons in compendium analyses with user data, avoiding the expense and time of generating additional datasets. However, the user must be able to process their new data in the same manner for these comparisons to be useful. This can be non-trivial. Although the executables themselves are usually available in repositories, the GDC pipelines that describe that entire analysis workflow are currently published as text-based standard operating procedures (SOPs). It is difficult to document a computational workflow to the level of detail and accuracy required to reproduce the results. Discrepancies between versions and exclusions of details accumulate as the documentation inevitably lags behind code revisions. We address this problem by converting the SOPs into a downloadable and executable format. Specifically, we converted the GDC DNA sequencing (DNA-Seq) and the GDC mRNA sequencing (mRNA-Seq) SOPs into reproducible, self-installing, containerized, and interactive graphical workflows. These can be applied to reproducibly process user data and to harmonize datasets across repositories. Using our publicly available graphical workflows, we harmonize raw RNA-Seq datasets from the GDC and the Genotype-Tissue Expression (GTEx) project that were originally processed using different methodologies to illustrate the importance of uniform processing of control and treatment data for accurate inference of differentially expressed genes. By disseminating the analytical methodology in a reproducible and easily executed form, we greatly increase the utility of the GDC by enabling researchers to uniformly process custom data and datasets across multiple repositories to enhance data interpretation. Our approach and open-source executable workflows of making the analytical process as readily available as the data can be applied to other data repositories to increase their impact on scientific research.Competing Interest StatementLHH and KYY have equity interest in Biodepot LLC, which receives compensation from NCI SBIR contract numbers 75N91020C00009 and 75N91021C00022. The terms of this arrangement have been reviewed and approved by the University of Washington in accordance with its policies governing outside work and financial conflicts of interest in research.Footnotes* In this revision, we updated the content to reflect the latest data releases from the NCI Genomic Data Commons. We also made our contributions in this work clearer by revising the title, abstract and introduction. In addition, we re-tested our workflows, cleaned up the GitHub repository, added documentation, and include only the workflows that work.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter