Catalogue Search | MBRL

hist2RNA: An Efficient Deep Learning Architecture to Predict Gene Expression from Breast Cancer Histopathology Images

by Graham, Peter H. , Mondol, Raktim Kumar , Millar, Ewan K. A. in Annotations , Breast cancer , Cancer therapies

2023

Gene expression can be used to subtype breast cancer with improved prediction of risk of recurrence and treatment responsiveness over that obtained using routine immunohistochemistry (IHC). However, in the clinic, molecular profiling is primarily used for ER+ breast cancer, which is costly, tissue destructive, requires specialised platforms, and takes several weeks to obtain a result. Deep learning algorithms can effectively extract morphological patterns in digital histopathology images to predict molecular phenotypes quickly and cost-effectively. We propose a new, computationally efficient approach called hist2RNA inspired by bulk RNA sequencing techniques to predict the expression of 138 genes (incorporated from 6 commercially available molecular profiling tests), including luminal PAM50 subtype, from hematoxylin and eosin (H&E)-stained whole slide images (WSIs). The training phase involves the aggregation of extracted features for each patient from a pretrained model to predict gene expression at the patient level using annotated H&E images from The Cancer Genome Atlas (TCGA, n = 335). We demonstrate successful gene prediction on a held-out test set (n = 160, corr = 0.82 across patients, corr = 0.29 across genes) and perform exploratory analysis on an external tissue microarray (TMA) dataset (n = 498) with known IHC and survival information. Our model is able to predict gene expression and luminal PAM50 subtype (Luminal A versus Luminal B) on the TMA dataset with prognostic significance for overall survival in univariate analysis (c-index = 0.56, hazard ratio = 2.16 (95% CI 1.12–3.06), p < 5 × 10−3), and independent significance in multivariate analysis incorporating standard clinicopathological variables (c-index = 0.65, hazard ratio = 1.87 (95% CI 1.30–2.68), p < 5 × 10−3). The proposed strategy achieves superior performance while requiring less training time, resulting in less energy consumption and computational cost compared to patch-based models. Additionally, hist2RNA predicts gene expression that has potential to determine luminal molecular subtypes which correlates with overall survival, without the need for expensive molecular testing.

Journal Article

Share this book

Add to My Shelf

A comparison of computational methods for expression forecasting

by Yang, Yunxiao , Battle, Alexis , Weinstock, Joshua S. in Animal Genetics and Genomics , Benchmarks v2.0 , Bioinformatics

2025

Diverse machine learning methods promise to forecast gene expression changes in response to novel genetic perturbations. However, these methods’ accuracy is not well characterized. We created a benchmarking platform that combines a panel of 11 large-scale perturbation datasets with an expression forecasting software engine that encompasses or interfaces to a wide variety of methods. We used our platform to assess methods, parameters, and sources of auxiliary data, finding that it is uncommon for expression forecasting methods to outperform simple baselines. Our platform will serve as a resource to improve methods and to identify contexts in which expression forecasting can succeed.

Journal Article

Share this book

Add to My Shelf

ENGEP: advancing spatial transcriptomics with accurate unmeasured gene expression prediction

by Yang, Shi-Tong , Zhang, Xiao-Fei in Accuracy , Animal Genetics and Genomics , Bioinformatics

2023

Imaging-based spatial transcriptomics techniques provide valuable spatial and gene expression information at single-cell resolution. However, their current capability is restricted to profiling a limited number of genes per sample, resulting in most of the transcriptome remaining unmeasured. To overcome this challenge, we develop ENGEP, an ensemble learning-based tool that predicts unmeasured gene expression in spatial transcriptomics data by using multiple single-cell RNA sequencing datasets as references. ENGEP outperforms current state-of-the-art tools and brings biological insight by accurately predicting unmeasured genes. ENGEP has exceptional efficiency in terms of runtime and memory usage, making it scalable for analyzing large datasets.

Journal Article

Share this book

Add to My Shelf

Link Between Individual Codon Frequencies and Protein Expression: Going Beyond Codon Adaptation Index

by Zaytsev, Konstantin , Bogatyreva, Natalya , Fedorov, Alexey in Accuracy , Algorithms , Amino acids

2024

An important role of a particular synonymous codon composition of a gene in its expression level is well known. There are a number of algorithms optimizing codon usage of recombinant genes to maximize their expression in host cells. Nevertheless, the underlying mechanism remains unsolved and is of significant relevance. In the realm of modern biotechnology, directing protein production to a specific level is crucial for metabolic engineering, genome rewriting and a growing number of other applications. In this study, we propose two new simple statistical and empirical methods for predicting the protein expression level from the nucleotide sequence of the corresponding gene: Codon Expression Index Score (CEIS) and Codon Productivity Score (CPS). Both of these methods are based on the influence of each individual codon in the gene on the overall expression level of the encoded protein and the frequencies of isoacceptors in the species. Our predictions achieve a correlation level of up to r = 0.7 with experimentally measured quantitative proteome data of Escherichia coli, which is superior to any previously proposed methods. Our work helps understand how codons determine protein abundances. Based on these methods, it is possible to design proteins optimized for expression in a particular organism.

Journal Article

Share this book

Add to My Shelf

Epigenetic Element-Based Transcriptome-Wide Association Study Identifies Novel Genes for Bipolar Disorder

by Yao, Shi , Wang, Jia-Hao , Guo, Yan in Adult , Bipolar disorder , Bipolar Disorder - genetics

2021

Abstract Since the bipolar disorder (BD) signals identified by genome-wide association study (GWAS) often reside in the non-coding regions, understanding the biological relevance of these genetic loci has proven to be complicated. Transcriptome-wide association studies (TWAS) providing a powerful approach to identify novel disease risk genes and uncover possible causal genes at loci identified previously by GWAS. However, these methods did not consider the importance of epigenetic regulation in gene expression. Here, we developed a novel epigenetic element-based transcriptome-wide association study (ETWAS) that tested the effects of genetic variants on gene expression levels with the epigenetic features as prior and further mediated the association between predicted expression and BD. We conducted an ETWAS consisting of 20 352 cases and 31 358 controls and identified 44 transcriptome-wide significant hits. We found 14 conditionally independent genes, and 10 genes that did not previously implicate with BD were regarded as novel candidate genes, such as ASB16 in the cerebellar hemisphere (P = 9.29 × 10–8). We demonstrated that several genome-wide significant signals from the BD GWAS driven by genetically regulated expression, and NEK4 explained 90.1% of the GWAS signal. Additionally, ETWAS identified genes could explain heritability beyond that explained by GWAS-associated SNPs (P = 5.60 × 10–66). By querying the SNPs in the final models of identified genes in phenome databases, we identified several phenotypes previously associated with BD, such as schizophrenia and depression. In conclusion, ETWAS is a powerful method, and we identified several novel candidate genes associated with BD.

Journal Article

Share this book

Add to My Shelf

Proformer: a hybrid macaron transformer model predicts expression values from promoter sequences

by Kwak, Il-Youp , Garry, Daniel J. , Gong, Wuming in Algorithms , Bar codes , Bioinformatics

2024

The breakthrough high-throughput measurement of the cis-regulatory activity of millions of randomly generated promoters provides an unprecedented opportunity to systematically decode the cis-regulatory logic that determines the expression values. We developed an end-to-end transformer encoder architecture named Proformer to predict the expression values from DNA sequences. Proformer used a Macaron-like Transformer encoder architecture, where two half-step feed forward (FFN) layers were placed at the beginning and the end of each encoder block, and a separable 1D convolution layer was inserted after the first FFN layer and in front of the multi-head attention layer. The sliding k -mers from one-hot encoded sequences were mapped onto a continuous embedding, combined with the learned positional embedding and strand embedding (forward strand vs. reverse complemented strand) as the sequence input. Moreover, Proformer introduced multiple expression heads with mask filling to prevent the transformer models from collapsing when training on relatively small amount of data. We empirically determined that this design had significantly better performance than the conventional design such as using the global pooling layer as the output layer for the regression task. These analyses support the notion that Proformer provides a novel method of learning and enhances our understanding of how cis-regulatory sequences determine the expression values.

Journal Article

Share this book

Add to My Shelf

Prediction of BAP1 Expression in Uveal Melanoma Using Densely-Connected Deep Classification Networks

by Qi, Xingqun , Zhang, Guanhong , Zhou, Xiaoguang in Artificial intelligence , BRCA1 protein , Classification

2019

Uveal melanoma is the most common primary intraocular malignancy in adults, with nearly half of all patients eventually developing metastases, which are invariably fatal. Manual assessment of the level of expression of the tumor suppressor BRCA1-associated protein 1 (BAP1) in tumor cell nuclei can identify patients with a high risk of developing metastases, but may suffer from poor reproducibility. In this study, we verified whether artificial intelligence could predict manual assessments of BAP1 expression in 47 enucleated eyes with uveal melanoma, collected from one European and one American referral center. Digitally scanned pathology slides were divided into 8176 patches, each with a size of 256 × 256 pixels. These were in turn divided into a training cohort of 6800 patches and a validation cohort of 1376 patches. A densely-connected classification network based on deep learning was then applied to each patch. This achieved a sensitivity of 97.1%, a specificity of 98.1%, an overall diagnostic accuracy of 97.1%, and an F1-score of 97.8% for the prediction of BAP1 expression in individual high resolution patches, and slightly less with lower resolution. The area under the receiver operating characteristic (ROC) curves of the deep learning model achieved an average of 0.99. On a full tumor level, our network classified all 47 tumors identically with an ophthalmic pathologist. We conclude that this deep learning model provides an accurate and reproducible method for the prediction of BAP1 expression in uveal melanoma.

Journal Article

Share this book

Add to My Shelf

Multi‐Scale Mapping of Gene Expression from Whole‐slide Images for Identifying Phenotype‐Associated Subpopulations

by Guo, Yujia , Xie, Jiajing , Zhi, Tong

2026

Discovery of phenotype‐associated subpopulations is critical for targeted therapies and prognostic biomarker discovery, which requires multi‐scale gene expression. Deep learning advancements have enabled cost‐effective genetic alteration inference from whole‐slide images (WSIs), but most methods operate at a single scale. This study presents BiSCALE, a deep‐learning framework that predicts gene expression from WSIs at both tissue (bulk) and near‐cellular (spot) levels and links these predictions to clinical phenotypes. The framework integrates a WSI foundation encoder with a Vision–Mamba fusion module and a two‐stage training strategy to bridge scale and distribution differences between bulk and spot data. Trained on 2109 bulk tumor samples and 141 000 spatial transcriptomics spots across three cancer types, BiSCALE outperforms established bulk and spatial baselines, generalizes well to independent cohorts, and demonstrates strong concordance between predicted bulk and spot expression profiles. It recovers biologically relevant pathway activity and supports downstream applications, including patient‐level risk stratification from bulk WSIs and spot‐level cell‐identity annotation. BiSCALE also identifies phenotype‐associated subpopulations, including niches linked to recurrence and hypoxia. These results establish BiSCALE as a cost‐effective approach for multi‐scale gene analysis and phenotype‐associated feature discovery from routine pathology. All code used in this study are available at: https://github.com/Hailong‐Zheng/BiSCALE .

Journal Article

Share this book

Add to My Shelf

Predicting Gene Expression Responses to Cold in Arabidopsis thaliana Using Natural Variation in DNA Sequence

by Takou, Margarita , Lasky, Jesse R. , Bellis, Emily S. in Adaptation , Arabidopsis - genetics , Arabidopsis thaliana

2025

Background/Objectives: The evolution of gene expression responses is a critical component of population adaptation to variable environments. Predicting how DNA sequence influences expression is challenging because the genotype-to-phenotype map is not well resolved for cis-regulatory elements, transcription factor binding, regulatory interactions, and epigenetic features, not to mention how these factors respond to the environment. Methods: We tested if flexible machine learning models could learn some of the underlying cis-regulatory genotype-to-phenotype map to predict expression response to a specific environment. We tested this approach using cold-responsive transcriptome profiles in five Arabidopsis thaliana natural accessions. Results: We first tested for evidence that cis regulation plays a role in environmental response, finding 14 and 15 motifs that were significantly enriched within the up- and downstream regions of cold-responsive differentially regulated genes (DEGs). We next applied convolutional neural networks (CNNs), which learn de novo cis-regulatory motifs in DNA sequences to predict expression response to cold. We found that CNNs predicted differential expression with moderate accuracy, with evidence that predictions were hindered by the biological complexity of regulation and the large potential regulatory code. Conclusions: Overall, approaches for predicting DEGs between specific environments based only on proximate DNA sequences require further development. It may be necessary to incorporate additional biological information into models to generate accurate predictions that will be useful to population biologists.

Journal Article

Share this book

Add to My Shelf

Transfer learning with pre-trained language models for protein expression level prediction in Escherichia coli

by Li, Haoran , Liao, Xiaoping , Yang, Chunhe in Accuracy , Adaptation , Amino acid sequence

2026

Accurately predicting recombinant protein expression in Escherichia coli remains a long-standing challenge due to the multifactorial nature of gene regulation and translation. Existing computational approaches typically emphasize either codon usage or protein sequence features, limiting predictive accuracy and generalizability. Here we present TLCP-EPE, a transfer learning framework that, for the first time, fuses codon- and protein-level pre-trained language models to jointly capture determinants of expression. By fine-tuning CaLM and ProtT5 with low-rank adaptation (LoRA) and integrating their embeddings through a BiGRU-MLP predictor, TLCP-EPE learns expression-aware representations that outperform state-of-the-art methods. Across two independent test datasets, TLCP-EPE achieved robust performance (AUC 0.835 on codon data; AUC 0.713 on protein data), consistently surpassing conventional codon-based metrics and deep learning baselines. Our results demonstrate that dual-modal modeling of codon and protein sequences enables more accurate and generalizable prediction of expression levels, providing a powerful foundation for rational protein design and biomanufacturing applications.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter