Catalogue Search | MBRL

Enabling cross-indication protein expression analysis using a curated pan-cancer dataset and a tailored workflow

by Hurt, Elaine , Pullman, Benjamin S. , Zhong, Wenyan in 631/114 , 631/67 , Algorithms

2026

The National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium (CPTAC) recently generated harmonized genomic, transcriptomic, proteomic, and clinical data for over 1,000 tumors across 10 cohorts to facilitate pan-cancer discovery research. However, protein expression comparison across CPTAC cohorts remains challenging due to non-uniform missing data and varying protein expression distribution patterns across tumor types. To enable the cancer research community to conduct robust cross-cohort protein expression analysis, we present a curated and normalized pan-cancer protein expression dataset derived from the CPTAC pan-cancer study. Our workflow integrates systematic filtering, various missing data handling and normalization strategies. We developed a novel algorithm to select robustly expressed proteins in tumors within any CPTAC cohort; applied a cohort hybrid imputation approach to protein abundance values from FragPipe within each cohort, based on protein expression distribution patterns; and calculated intensity-based absolute quantification using protein abundance values and applied both global and smooth quantile normalization methods. Our analysis demonstrates that global quantile normalization surpasses both smooth quantile normalization and no normalization, as evidenced by its higher rank correlation across cancer cohorts between CPTAC and TCGA for selected proteins. The findings suggest that combining cohort hybrid imputation with global quantile normalization is an effective method for creating a normalized CPTAC pan-cancer protein dataset, which can facilitate the study of protein expression across different cancer types and accelerate cancer research.

Journal Article

Share this book

Add to My Shelf

MassIVE.quant: a community resource of quantitative mass spectrometry–based proteomics datasets

by Teo, Guo Ci , Pavlou, Maria , Wollscheid, Bernd in 631/114/2402 , 631/114/794 , Algorithms

2020

MassIVE.quant is a repository infrastructure and data resource for reproducible quantitative mass spectrometry–based proteomics, which is compatible with all mass spectrometry data acquisition types and computational analysis tools. A branch structure enables MassIVE.quant to systematically store raw experimental data, metadata of the experimental design, scripts of the quantitative analysis workflow, intermediate input and output files, as well as alternative reanalyses of the same dataset. MassIVE.quant is a data repository and data resource for reproducible quantitative mass spectrometry–based proteomics.

Journal Article

Share this book

Add to My Shelf

Quantitative and multiplexed DNA methylation analysis using long-read single-molecule real-time bisulfite sequencing (SMRT-BS)

by Qiao, Wanqiong , Peter, Inga , Desnick, Robert J in Analysis , Animal Genetics and Genomics , Biomedical and Life Sciences

2015

Background DNA methylation has essential roles in transcriptional regulation, imprinting, X chromosome inactivation and other cellular processes, and aberrant CpG methylation is directly involved in the pathogenesis of human imprinting disorders and many cancers. To address the need for a quantitative and highly multiplexed bisulfite sequencing method with long read lengths for targeted CpG methylation analysis, we developed single-molecule real-time bisulfite sequencing (SMRT-BS). Results Optimized bisulfite conversion and PCR conditions enabled the amplification of DNA fragments up to ~1.5 kb, and subjecting overlapping 625–1491 bp amplicons to SMRT-BS indicated high reproducibility across all amplicon lengths (r = 0.972) and low standard deviations (≤0.10) between individual CpG sites sequenced in triplicate. Higher variability in CpG methylation quantitation was correlated with reduced sequencing depth, particularly for intermediately methylated regions. SMRT-BS was validated by orthogonal bisulfite-based microarray (r = 0.906; 42 CpG sites) and second generation sequencing (r = 0.933; 174 CpG sites); however, longer SMRT-BS amplicons (>1.0 kb) had reduced, but very acceptable, correlation with both orthogonal methods (r = 0.836-0.897 and r = 0.892-0.927, respectively) compared to amplicons less than ~1.0 kb (r = 0.940-0.951 and r = 0.948-0.963, respectively). Multiplexing utility was assessed by simultaneously subjecting four distinct CpG island amplicons (702–866 bp; 325 CpGs) and 30 hematological malignancy cell lines to SMRT-BS (average depth of 110X), which identified a spectrum of highly quantitative methylation levels across all interrogated CpG sites and cell lines. Conclusions SMRT-BS is a novel, accurate and cost-effective targeted CpG methylation method that is amenable to a high degree of multiplexing with minimal clonal PCR artifacts. Increased sequencing depth is necessary when interrogating longer amplicons (>1.0 kb) and the previously reported bisulfite sequencing PCR bias towards unmethylated DNA should be considered when measuring intermediately methylated regions. Coupled with an optimized bisulfite PCR protocol, SMRT-BS is capable of interrogating ~1.5 kb amplicons, which theoretically can cover ~91% of CpG islands in the human genome.

Journal Article

Share this book

Add to My Shelf

Rare variant associations with plasma protein levels in the UK Biobank

by Viollet, Coralie , Petrovski, Slavé , Mitchell, Jonathan in 45/23 , 45/43 , 631/208/205

2023

Integrating human genomics and proteomics can help elucidate disease mechanisms, identify clinical biomarkers and discover drug targets 1 – 4 . Because previous proteogenomic studies have focused on common variation via genome-wide association studies, the contribution of rare variants to the plasma proteome remains largely unknown. Here we identify associations between rare protein-coding variants and 2,923 plasma protein abundances measured in 49,736 UK Biobank individuals. Our variant-level exome-wide association study identified 5,433 rare genotype–protein associations, of which 81% were undetected in a previous genome-wide association study of the same cohort 5 . We then looked at aggregate signals using gene-level collapsing analysis, which revealed 1,962 gene–protein associations. Of the 691 gene-level signals from protein-truncating variants, 99.4% were associated with decreased protein levels. STAB1 and STAB2 , encoding scavenger receptors involved in plasma protein clearance, emerged as pleiotropic loci, with 77 and 41 protein associations, respectively. We demonstrate the utility of our publicly accessible resource through several applications. These include detailing an allelic series in NLRC4 , identifying potential biomarkers for a fatty liver disease-associated variant in HSD17B13 and bolstering phenome-wide association studies by integrating protein quantitative trait loci with protein-truncating variants in collapsing analyses. Finally, we uncover distinct proteomic consequences of clonal haematopoiesis (CH), including an association between TET2- CH and increased FLT3 levels. Our results highlight a considerable role for rare variation in plasma protein abundance and the value of proteogenomics in therapeutic discovery. A set of three papers in Nature reports a new proteomics resource from the UK Biobank and initial analysis of common and rare genetic variant associations with plasma protein levels.

Journal Article

Share this book

Add to My Shelf

Universal Spectrum Identifier for mass spectra

by Deutsch, Eric W , Van Den Bossche Tim , Bittremieux Wout in Ions , Mass spectra , Mass spectrometry

2021

Mass spectra provide the ultimate evidence to support the findings of mass spectrometry proteomics studies in publications, and it is therefore crucial to be able to trace the conclusions back to the spectra. The Universal Spectrum Identifier (USI) provides a standardized mechanism for encoding a virtual path to any mass spectrum contained in datasets deposited to public proteomics repositories. USI enables greater transparency of spectral evidence, with more than 1 billion USI identifications from over 3 billion spectra already available through ProteomeXchange repositories.Universal Spectrum Identifier (USI) provides a standardized mechanism for encoding a virtual path to mass spectra deposited to public repositories or contained in public spectral libraries.

Journal Article

Share this book

Add to My Shelf

GNPS Dashboard: collaborative exploration of mass spectrometry data in the web browser

by Schmid, Robin , Garg, Neha , Cummings, Dale A. in 631/114/2398 , 631/1647/296 , 631/1647/320

2022

Access to web-based platforms has enabled scientists to perform research remotely. A critical aspect of mass spectrometry data analysis is the inspection, analysis, and visualization of the raw data to validate data quality and confirm statistical observations. We developed the GNPS Dashboard, a web-based data visualization tool, to facilitate synchronous collaborative inspection, visualization, and analysis of private and public mass spectrometry data remotely.

Journal Article

Share this book

Add to My Shelf

Genetic architecture of telomere length in 462,666 UK Biobank whole-genome sequences

by Deevi, Sri V. V. , Vitsios, Dimitrios , Samani, Nilesh J. in 45/43 , 631/208/212 , 692/699/1541/1990

2024

Telomeres protect chromosome ends from damage and their length is linked with human disease and aging. We developed a joint telomere length metric, combining quantitative PCR and whole-genome sequencing measurements from 462,666 UK Biobank participants. This metric increased SNP heritability, suggesting that it better captures genetic regulation of telomere length. Exome-wide rare-variant and gene-level collapsing association studies identified 64 variants and 30 genes significantly associated with telomere length, including allelic series in ACD and RTEL1 . Notably, 16% of these genes are known drivers of clonal hematopoiesis—an age-related somatic mosaicism associated with myeloid cancers and several nonmalignant diseases. Somatic variant analyses revealed gene-specific associations with telomere length, including lengthened telomeres in individuals with large SRSF2 -mutant clones, compared with shortened telomeres in individuals with clonal expansions driven by other genes. Collectively, our findings demonstrate the impact of rare variants on telomere length, with larger effects observed among genes also associated with clonal hematopoiesis. Genome-wide association analysis of an improved telomere length score, calculated from quantitative PCR and whole-genome sequencing measurements in 462,666 individuals in the UK Biobank, identifies novel genes and variants underlying this trait.

Journal Article

Share this book

Add to My Shelf

What Can Be Learned from Repository-Scale Public Mass Spectrometry Data?

by Pullman, Benjamin in Bioinformatics , Computer science

2022

High-throughput tandem mass spectrometry has enabled the detection and identification of over 75\\% of all human proteins predicted to result in translated gene products from an available tens of terabytes of public data in thousands of datasets. This thesis explores what we can learn from this, as well as the challenges that arise when considering proteomics data at a repository scale. First, we will consider validating what is known, through resources to build, curate, and explore both FDR-controlled and user submitted libraries. Second, we present a tool that allows for an automation of application of strict community guidelines criteria to any set of search results, including peak quality and novel FDR controls. Third, we introduce a method to illuminate the extent of what is not yet known using a new clustering approach designed to explicitly model peptide diversity by explicitly modeling spectrum coelutions. Finally, fourth, we developed a method for extremely fast single spectrum searches against spectrum repositories consisting of billions of spectra to both confirm or refute knowledge base IDs as well as discover similar spectra to those consistently unidentified.

Dissertation

Share this book

Add to My Shelf

microbeMASST: a taxonomically informed mass spectrometry search tool for microbial metabolomics data

by Schmid, Robin , Oles, Renee , Knight, Rob in 101/58 , 631/114 , 631/326

2024

microbeMASST, a taxonomically informed mass spectrometry (MS) search tool, tackles limited microbial metabolite annotation in untargeted metabolomics experiments. Leveraging a curated database of >60,000 microbial monocultures, users can search known and unknown MS/MS spectra and link them to their respective microbial producers via MS/MS fragmentation patterns. Identification of microbe-derived metabolites and relative producers without a priori knowledge will vastly enhance the understanding of microorganisms’ role in ecology and human health. microbeMASST is a tool to associate known and unknown metabolites to microbial producers leveraging untargeted metabolomics data.

Journal Article

Share this book

Add to My Shelf

Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees

by Papakonstantinou Yannis , Li, Yuliang , Pullman, Benjamin in Algorithms , Computer vision , Mass spectrometry

2021

Given a database of vectors, a cosine threshold query returns all vectors in the database having cosine similarity to a query vector above a given threshold θ. These queries arise naturally in many applications, such as document retrieval, image search, and mass spectrometry. The paper considers the efficient evaluation of such queries, as well as of the closely related top-k cosine similarity queries. It provides novel optimality guarantees that exhibit good performance on real datasets. We take as a starting point Fagin’s well-known Threshold Algorithm (TA), which can be used to answer cosine threshold queries as follows: an inverted index is first built from the database vectors during pre-processing; at query time, the algorithm traverses the index partially to gather a set of candidate vectors to be later verified for θ-similarity. However, directly applying TA in its raw form misses significant optimization opportunities. Indeed, we first show that one can take advantage of the fact that the vectors can be assumed to be normalized, to obtain an improved, tight stopping condition for index traversal and to efficiently compute it incrementally. Then we show that multiple real-world data sets from mass spectrometry, natural language process, and computer vision exhibit a certain form of data skewness and we exploit this property to obtain better traversal strategies. We show that under the skewness assumption, the new traversal strategy has a strong, near-optimal performance guarantee. The techniques developed in the paper are quite general since they can be applied to a large class of similarity functions beyond cosine.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter