Catalogue Search | MBRL

An augmented GSNMF model for complete deconvolution of bulk RNA-seq data

by Wang, Xue , Chen, Duan , Li, Shaoyu in Algorithms , Alzheimer's disease , Animals

2025

Performing complete deconvolution analysis for bulk RNA-seq data to obtain both cell type specific gene expression profiles (GEP) and relative cell abundances is a challenging task. One of the fundamental models used, the nonnegative matrix factorization (NMF), is mathematically ill-posed. Although several of complete deconvolution methods have been developed and their estimates compared to ground truth for some datasets appear promising, a comprehensive understanding of how to circumvent the ill-posedness and improve solution accuracy is still lacking. In this paper, we first investigate the necessary requirements for a given dataset to satisfy the solvability conditions in NMF theory. Even with solvability conditions, the \"unique\" solutions of NMF are still subject to a rescaling matrix. Therefore, we provide estimates of the converged local minima and the possible rescaling matrix, based on informative initial conditions. Using these strategies, we develop a new pipeline of pseudo-bulk tissue data augmented, geometric structure guided NMF model (GSNMF+). In our approach, pseudo-bulk tissue data is generated, by statistical distribution simulated pseudo cellular compositions and single-cell RNAseq (scRNAseq) data, and then mixed with original dataset. The constituent matrices of the hybrid dataset then satisfy the weak solvability conditions of NMF. Furthermore, an estimated rescaling matrix is used to adjust minimizer of the NMF, which is expected to reduce mean square root errors of solutions. Our algorithms are tested on several realistic bulk-tissue dataset and have shown significant improvements in scenarios with singular cellular compositions.

Journal Article

Share this book

Add to My Shelf

Modeling overdispersion heterogeneity in differential expression analysis using mixtures

by Robin, Stéphane , Viroli, Cinzia , Bonafede, Elisabetta in binomial distribution , BIOMETRIC METHODOLOGY , Biometrics

2016

Next-generation sequencing technologies now constitute a method of choice to measure gene expression. Data to analyze are read counts, commonly modeled using negative binomial distributions. A relevant issue associated with this probabilistic framework is the reliable estimation of the overdispersion parameter, reinforced by the limited number of replicates generally observable for each gene. Many strategies have been proposed to estimate this parameter, but when differential analysis is the purpose, they often result in procedures based on plug-in estimates, and we show here that this discrepancy between the estimation framework and the testing framework can lead to uncontrolled type-I errors. Instead, we propose a mixture model that allows each gene to share information with other genes that exhibit similar variability. Three consistent statistical tests are developed for differential expression analysis. We show through a wide simulation study that the proposed method improves the sensitivity of detecting differentially expressed genes with respect to the common procedures, since it reaches the nominal value for the type-I error, while keeping elevate discriminative power between differentially and not differentially expressed genes. The method is finally illustrated on prostate cancer RNA-Seq data.

Journal Article

Share this book

Add to My Shelf

scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets

by Speed, Terence P. , Yang, Pengyi , Ormerod, John T. in Algorithms , Animals , Collection

2019

Concerted examination of multiple collections of single-cell RNA sequencing (RNA-seq) data promises further biological insights that cannot be uncovered with individual datasets. Here we present scMerge, an algorithm that integrates multiple single-cell RNA-seq datasets using factor analysis of stably expressed genes and pseudoreplicates across datasets. Using a large collection of public datasets, we benchmark scMerge against published methods and demonstrate that it consistently provides improved cell type separation by removing unwanted factors; scMerge can also enhance biological discovery through robust data integration, which we show through the inference of development trajectory in a liver dataset collection.

Journal Article

Share this book

Add to My Shelf

A comprehensive workflow for optimizing RNA-seq data analysis

by Ren, Shu-Ning , Li, Yun , Wang, Hou-Ling in Accuracy , Alternative Splicing , Analysis

2024

Background Current RNA-seq analysis software for RNA-seq data tends to use similar parameters across different species without considering species-specific differences. However, the suitability and accuracy of these tools may vary when analyzing data from different species, such as humans, animals, plants, fungi, and bacteria. For most laboratory researchers lacking a background in information science, determining how to construct an analysis workflow that meets their specific needs from the array of complex analytical tools available poses a significant challenge. Results By utilizing RNA-seq data from plants, animals, and fungi, it was observed that different analytical tools demonstrate some variations in performance when applied to different species. A comprehensive experiment was conducted specifically for analyzing plant pathogenic fungal data, focusing on differential gene analysis as the ultimate goal. In this study, 288 pipelines using different tools were applied to analyze five fungal RNA-seq datasets, and the performance of their results was evaluated based on simulation. This led to the establishment of a relatively universal and superior fungal RNA-seq analysis pipeline that can serve as a reference, and certain standards for selecting analysis tools were derived for reference. Additionally, we compared various tools for alternative splicing analysis. The results based on simulated data indicated that rMATS remained the optimal choice, although consideration could be given to supplementing with tools such as SpliceWiz. Conclusion The experimental results demonstrate that, in comparison to the default software parameter configurations, the analysis combination results after tuning can provide more accurate biological insights. It is beneficial to carefully select suitable analysis software based on the data, rather than indiscriminately choosing tools, in order to achieve high-quality analysis results more efficiently.

Journal Article

Share this book

Add to My Shelf

Soft graph clustering for single-cell RNA sequencing data

by Ning, Zhiyuan , Wu, Min , Xu, Ping in Algorithms , Annotations , Bioinformatics

2025

Background Clustering analysis is fundamental in single-cell RNA sequencing (scRNA-seq) data analysis for elucidating cellular heterogeneity and diversity. Recent graph-based scRNA-seq clustering methods, particularly graph neural networks (GNNs), have significantly improved in tackling the challenges of high-dimension, high-sparsity, and frequent dropout events that lead to ambiguous cell population boundaries. However, one major challenge for GNN-based methods is their reliance on hard graph constructions derived from similarity matrices. These constructions introduce difficulties when applied to scRNA-seq data due to: (i) The simplification of intercellular relationships into binary edges (0 or 1) by applying thresholds, which restricts the capture of continuous similarity features among cells and leads to significant information loss. (ii) The presence of significant inter-cluster connections within hard graphs, which can confuse GNN methods that rely heavily on graph structures, potentially causing erroneous message propagation and biased clustering outcomes. Results To tackle these challenges, we introduce scSGC, a Soft Graph Clustering for single-cell RNA sequencing data, which aims to more accurately characterize continuous similarities among cells through non-binary edge weights, thereby mitigating the limitations of rigid data structures. The scSGC framework comprises three core components: (i) a zero-inflated negative binomial (ZINB)-based feature autoencoder designed to effectively handle the sparsity and dropout issues in scRNA-seq data; (ii) a dual-channel cut-informed soft graph embedding module, constructed through deep graph-cut information, capturing continuous similarities between cells while preserving the intrinsic data structures of scRNA-seq; and (iii) an optimal transport-based clustering optimization module, achieving optimal delineation of cell populations while maintaining high biological relevance. Conclusion By integrating dual-channel cut-informed soft graph representation learning, a ZINB-based feature autoencoder, and optimal transport-driven clustering optimization, scSGC effectively overcomes the challenges associated with traditional hard graph constructions in GNN methods. Extensive experiments across ten datasets demonstrate that scSGC outperforms 13 state-of-the-art clustering models in clustering accuracy, cell type annotation, and computational efficiency. These results highlight its substantial potential to advance scRNA-seq data analysis and deepen our understanding of cellular heterogeneity.

Journal Article

Share this book

Add to My Shelf

Robust subspace structure discovery for cell type identification in scRNA-seq data

by Wei, Xindian , Wu, Si , Shen, Wenjun in Algorithms , Bioinformatics , Biomedical and Life Sciences

2025

Single-cell RNA sequencing (scRNA-seq) technology has transformed gene expression studies by enabling analysis at the individual cell level, offering unprecedented insights into cellular heterogeneity. A key challenge in scRNA-seq data analysis is cell type identification, which requires grouping cells with similar gene expression profiles using unsupervised clustering methods. However, the high dimensionality, inherent noise, and significant sparsity of scRNA-seq data present substantial obstacles to accurately determining relationships among cell samples. To address these challenges, we propose a novel deep subspace clustering approach for cell type identification that captures a more reliable subspace structure from scRNA-seq data. Our method leverages a robust self-representation learning framework to effectively characterize and learn the underlying cluster structure. This framework is optimized through an integrated strategy combining a structure-guided approach with an optimal transport algorithm, enhancing the robustness of the subspace clustering process. By mitigating the effects of noise and sparsity in scRNA-seq data, this approach enables more accurate cell clustering. Experimental results on 18 real scRNA-seq datasets demonstrate that our method outperforms several state-of-the-art clustering approaches tailored for scRNA-seq data, excelling in both accuracy and interpretability.

Journal Article

Share this book

Add to My Shelf

Ensemble machine learning-based pre-trained annotation approach for scRNA-seq data using gradient boosting with genetic optimizer

by Ead, Waleed M. , Lu, Jian , Qiu, Yushan in Accuracy , Algorithms , Analysis

2025

Single-cell RNA sequencing (scRNA-seq) has revolutionized the study of gene expression by allowing researchers to analyze the transcriptomes of individual cells. This technology provides unprecedented insights into cellular heterogeneity, cellular states, and biological processes at a single-cell resolution. The problem of single-cell RNA annotation involves assigning meaningful labels or annotations to each cell in the scRNA-seq dataset, indicating its corresponding cell type, state, or biological function. Current annotation methods are often challenged by limited source data quality, sensitivity to batch effects, and poor adaptability to uncharacterized cell types. We propose an ensemble machine learning-based pre-trained annotation framework that integrates gradient boosting and genetic optimization for robust feature selection. The proposed method uses ensemble learning to enhance annotation accuracy under data scarcity, addressing limitations in existing supervised methods by leveraging a combination of multiple annotated datasets and feature alignment strategies. Through comprehensive benchmarking across varied biological contexts, we demonstrate that the proposed approach significantly improves annotation accuracy and generalization across different scRNA-seq platforms, especially under conditions of reduced reference data. Results confirm its versatility and resilience in accurately annotating cell types, even under reduced data conditions, establishing it as a powerful tool for cell-type classification in scRNA-seq data.

Journal Article

Share this book

Add to My Shelf

Scmaskgan: masked multi-scale CNN and attention-enhanced GAN for scRNA-seq dropout imputation

by Xu, Li , Li, Hanxiao , Cong, Xiaohong in Algorithms , Artificial neural networks , Bioinformatics

2025

Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis of cellular heterogeneity, but dropout events, where gene expression is undetected in individual cells, present a significant challenge. We propose scMASKGAN, which transforms matrix imputation into a pixel restoration task to improve the recovery of missing gene expression data. Specifically, we integrate masking, convolutional neural networks (CNNs), attention mechanisms, and residual networks (ResNets) to effectively address dropout events in scRNA-seq data. The masking mechanism ensures the preservation of complete cellular information, while convolution and attention mechanisms are employed to capture both global and local features. Residual networks augment feature representation and effectively mitigate the risk of model overfitting. Additionally, cell-type labels are incorporated as constraints to guide the methods in learning more accurate cellular features. Finally, multiple experiments were conducted to evaluate the methods’ performance using seven different data types and scRNA-seq data from ten neuroblastoma samples. The results demonstrate that the data imputed by scMASKGAN not only perform excellently across various evaluation metrics but also significantly enhance the effectiveness of downstream analyses, enabling a more comprehensive exploration of underlying biological information.

Journal Article

Share this book

Add to My Shelf

scIRT: Imputation and Dimensionality Reduction for Single-Cell RNA-Seq Data by Combining NMF with SMOTE

by Mou, Yunwen , Li, Shuchao , Ji, Guoli in Algorithms , Analysis , Animals

2026

The establishment and development of single-cell RNA-sequencing (scRNA-seq) technology has accelerated the analysis of cell genome characteristics down to the single-cell level. Despite the rapid development of scRNA-seq technology, we cannot obtain a complete gene expression matrix in the biological experiments, and the scRNA-seq data obtained from experiments also have a high dropout rate. Unfortunately, gene expression analysis and clustering tools require a complete matrix of gene expression values for classification or clustering calculations. Most imputation methods focus on the impact of the imputed high-dimensional expression matrix on clustering and cannot obtain the low-dimensional representation matrix, which may have an even better guiding effect on clustering. To this end, we designed an iterative imputation pipeline called scIRT to estimate dropout events for scRNA-seq and achieve dimensionality reduction simultaneously by combining the synthetic minority over-sampling technique (SMOTE) and non-negative matrix factorization (NMF). The adaptation of SMOTE effectively imputes missing data, while NMF performs dimensionality reduction and feature extraction on high-dimensional data. Using several scRNA-seq datasets, we demonstrated that this new approach achieved better and more robust performance than the existing approaches. We also compared the different effects of the imputed matrix and the low-dimensional representation matrix on clustering. ScIRT is a tool that can be used to preprocess scRNA-seq data. It can effectively recover missing data from scRNA-seq to facilitate downstream analyses such as cell type clustering and visualization.

Journal Article

Share this book

Add to My Shelf

Winnow-KAN: single-cell RNA-seq location recovery with small-gene-set spatial transcriptomics

by Zhang, Qihuang , Zhang, Yuyang in Algorithms , Bioinformatics , Biomedical and Life Sciences

2025

Keywords: Cell mapping, Deep Learning, Kolmogorov-Arnold network, Single-cell RNA-seq, Spatial transcriptomics. Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular heterogeneity. However, its collection process prevents the investigation of tissue organization due to the lack of spatial origins for the cells. Recent advances in computational methods have addressed this gap by leveraging spatial transcriptomics, which simultaneously profiles gene expression and spatial coordinates. While these state-of-the-art methods demonstrate excellent performance in cell location recovery, their effectiveness is often specific to the particular pair of scRNA-seq and spatial transcriptomics datasets used, limiting their scalability to larger datasets and generalizability to external query scRNA-seq data. In this study, we demonstrate the feasibility of leveraging a novel model architecture to address the redundancy in scRNA-seq datasets and facilitate prediction with a much smaller set of genes. We present Winnow-KAN, a method designed to reduce the number of required gene variables in cell-mapping tasks. Built on a modified structure of the Kolmogorov-Arnold Network, Winnow-KAN leverages the Kolmogorov-Arnold Representation Theorem to facilitate location predictions using fewer features than those required by multi-layer perceptron-based methods. Winnow-KAN includes a selector layer that reduces the size of the gene set used for prediction, enabling the model to recover query scRNA-seq data with performance comparable to MLP-based approaches that rely on the full gene set. We benchmarked Winnow-KAN using multiple datasets generated from brain and cancer tissues, derived from platforms such as Visium and MERFISH.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter