Catalogue Search | MBRL

Harmonizing and integrating the NCI Genomic Data Commons through accessible, interactive, and cloud-enabled workflows

by Yeung, Ka Yee , Hoang, Varik , Hung, Ling-Hong in Accessibility , Analysis , Annotations

2025

Cancer data is widely available in repositories such as the National Cancer Institute (NCI) Genomic Data Commons (GDC). These datasets could serve as controls or comparisons in compendium analyses with user data, avoiding the expense and time of generating additional datasets. However, the user must be able to process their new data in the same manner for these comparisons to be useful. This can be non-trivial. Although the executables themselves are usually available in repositories, the GDC pipelines that describe that entire analysis workflow are currently published as text-based standard operating procedures (SOPs). It is difficult to document a computational workflow to the level of detail and accuracy required to reproduce the results. Discrepancies between versions and exclusions of details accumulate as the documentation inevitably lags behind code revisions. Our goal is to enhance the utility of the GDC by converting the SOPs into an accessible and executable format. Specifically, we converted the GDC DNA sequencing (DNA-Seq) and the GDC mRNA sequencing (mRNA-Seq) SOPs into reproducible, self-installing, containerized, and interactive graphical workflows. These can be applied to reproducibly process user data and to harmonize datasets across repositories. Using our publicly available graphical workflows, we harmonize raw RNA-Seq datasets from the GDC and the Genotype-Tissue Expression (GTEx) project that were originally processed using different methodologies to illustrate the importance of uniform processing of control and treatment data for accurate inference of differentially expressed genes. By disseminating the analytical methodology in a reproducible and executable form, we greatly increase the utility of the GDC by enabling researchers to uniformly process custom data and datasets across multiple repositories to enhance data interpretation. Our approach and open-source executable workflows of making the analytical process as readily available as the data can be applied to other data repositories to increase their impact on scientific research.

Journal Article

Share this book

Add to My Shelf

Harmonizing and integrating the NCI Genomic Data Commons through accessible, interactive, and cloud-enabled workflows

by Yeung, Ka Yee , Hoang, Varik , Schmitz, Robert in Bioinformatics , Datasets , DNA sequencing

2024

Cancer data is widely available in repositories such as the National Cancer Institute (NCI) Genomic Data Commons (GDC). These datasets could serve as controls or comparisons in compendium analyses with user data, avoiding the expense and time of generating additional datasets. However, the user must be able to process their new data in the same manner for these comparisons to be useful. This can be non-trivial. Although the executables themselves are usually available in repositories, the GDC pipelines that describe that entire analysis workflow are currently published as text-based standard operating procedures (SOPs). It is difficult to document a computational workflow to the level of detail and accuracy required to reproduce the results. Discrepancies between versions and exclusions of details accumulate as the documentation inevitably lags behind code revisions. We address this problem by converting the SOPs into a downloadable and executable format. Specifically, we converted the GDC DNA sequencing (DNA-Seq) and the GDC mRNA sequencing (mRNA-Seq) SOPs into reproducible, self-installing, containerized, and interactive graphical workflows. These can be applied to reproducibly process user data and to harmonize datasets across repositories. Using our publicly available graphical workflows, we harmonize raw RNA-Seq datasets from the GDC and the Genotype-Tissue Expression (GTEx) project that were originally processed using different methodologies to illustrate the importance of uniform processing of control and treatment data for accurate inference of differentially expressed genes. By disseminating the analytical methodology in a reproducible and easily executed form, we greatly increase the utility of the GDC by enabling researchers to uniformly process custom data and datasets across multiple repositories to enhance data interpretation. Our approach and open-source executable workflows of making the analytical process as readily available as the data can be applied to other data repositories to increase their impact on scientific research.Competing Interest StatementLHH and KYY have equity interest in Biodepot LLC, which receives compensation from NCI SBIR contract numbers 75N91020C00009 and 75N91021C00022. The terms of this arrangement have been reviewed and approved by the University of Washington in accordance with its policies governing outside work and financial conflicts of interest in research.Footnotes* In this revision, we updated the content to reflect the latest data releases from the NCI Genomic Data Commons. We also made our contributions in this work clearer by revising the title, abstract and introduction. In addition, we re-tested our workflows, cleaned up the GitHub repository, added documentation, and include only the workflows that work.

Paper

Share this book

Add to My Shelf

Container Profiler: Profiling Resource Utilization of Containerized Big Data Pipelines

by Schooley, Raymond , Arumilli, Niharika , Perez, David in Bioinformatics , Containers , Gene sequencing

2023

This paper presents the Container Profiler, a software tool that measures and records the resource usage of any containerized task. Our tool profiles the CPU, memory, disk, and network utilization of containerized tasks collecting over fifty Linux operating system metrics at the virtual machine, container, and process levels. The Container Profiler supports performing time series profiling at a configurable sampling interval to enable continuous monitoring of the resources consumed by containerized tasks and pipelines. To investigate the utility of the Container Profiler, we profile the resource utilization requirements of a multi-stage bioinformatics analytical pipeline (RNA sequencing using unique molecular identifiers). We examine profiling metrics to assess patterns of CPU, disk, and network resource utilization across the different stages of the pipeline. We also quantify the profiling overhead of our Container Profiler tool to assess the impact of profiling a running pipeline with different levels of profiling granularity verifying that impacts are negligible. The Container Profiler provides a useful tool that can be used to continuously monitor the resource consumption of long and complex containerized applications that run locally or on the cloud. This can help identify bottlenecks where more resources are needed to improve performance.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter