Catalogue Search | MBRL

Crop phenotype prediction using SNP context and whole-genome feature embedding based on DNABERT-2

by Wang, Ting , Wang, Chao , Cui, Yunpeng in Accuracy , Agricultural research , Algorithms

2026

Background Modern agriculture demands precise genomic prediction to accelerate elite crop breeding, yet traditional genomic prediction approaches, such as genomic best linear unbiased prediction (GBLUP) and Bayesian methods, focus primarily on the cumulative effect of individual SNPs, thus neglecting the concerted influence that the surrounding sequence context has on the phenotype. Methods To overcome these limitations, we propose two novel feature embedding modes (SNP-context and whole-genome) based on DNABERT-2, a cross-species genomic foundation model that uses self-attention mechanisms and transfer learning to automatically identify conserved sequence features across diverse evolutionary lineages without prior biological assumptions. The whole-genome feature embedding aggregates genomic information at a global scale by pooling vectors from chunked sequences processed by DNABERT-2, whereas the context feature embedding captures local information by directly encoding variable-length (500–3000 bp) sequences centered on target SNPs. To reduce noise in the high-dimensional feature embeddings, we employed principal component analysis (PCA) and partial least squares (PLS) to project the features into a lower-dimensional space. We generated two kinds of feature embedding for three crop datasets (rice413, rice395, and maize301), investigated the impact of 500–3000 bp flanking SNP contexts on phenotypic prediction, and compared prediction accuracy variations across algorithms at 4–768 feature dimensions among the PCA, PLS, and no dimensionality reduction strategies. Results The results demonstrate that machine learning (ML) algorithms operating under the SNP-context embedding mode achieve greater accuracy and lower mean absolute errors (MAEs) than traditional SNP features, with performance peaking at optimal context lengths that proved to be trait-dependent (e.g., 1000 bp to 3000 bp), particularly for traits with low-to-moderate heritability (H 2 ∈ (0.2, 0.7]). In contrast, using whole-genome embeddings as input for ML can further improve the prediction accuracy for highly heritable traits (H 2 ∈ (0.7, 1.0]), even outperforming state-of-the-art deep learning models (such as DNNGP and ResGS) that rely on SNP markers. Conclusions The proposed feature embedding methods, which leverage DNABERT-2 to capture the contextual features of SNPs, effectively overcome the limitations of traditional prediction models. This study demonstrates that the SNP-context mode is superior for traits with low-to-moderate heritability, while the whole-genome embedding mode excels for highly heritable ones. Our work provides plant breeders with a flexible and powerful analytical framework, enabling them to select the most suitable phenotypic prediction method based on the complexity of the target trait, thereby accelerating genetic gain in the breeding of elite crop varieties.

Journal Article

Share this book

Add to My Shelf

DEEPKRIGING

by Sun, Ying , Reich, Brian J. , Li, Yuxiao

2024

In spatial statistics, a common objective is to predict values of a spatial process at unobserved locations by exploiting spatial dependence. Kriging provides the best linear unbiased predictor using covariance functions, and is often associated with Gaussian processes. However, for nonlinear predictions for nonGaussian and categorical data, the Kriging prediction is no longer optimal, and the associated variance is often overly optimistic. Although deep neural networks (DNNs) are widely used for general classification and prediction, they have not been studied thoroughly for data with spatial dependence. In this work, we propose a novel DNN structure for spatial prediction, where we capture the spatial dependence by adding an embedding layer of spatial coordinates with basis functions. We show in theory and simulation studies that the proposed DeepKriging method has a direct link to Kriging in the Gaussian case, and has multiple advantages over Kriging for nonGaussian and nonstationary data. That is, it provides nonlinear predictions, and thus has smaller approximation errors. Furthermore, it does not require operations on covariance matrices, and thus is scalable for large data sets. With sufficiently many hidden neurons, the proposed method provides an optimal prediction in terms of model capacity. In addition, we quantify prediction uncertainties based on density prediction, without assuming a data distribution. Finally, we apply the method to PM2.5 concentrations across the continental United States.

Journal Article

Share this book

Add to My Shelf

Learning General and Specific Embedding with Transformer for Few-Shot Object Detection

by Liu, Tongliang , Zhang, Jing , Tao, Dacheng in Artificial Intelligence , Computer Imaging , Computer Science

2025

Few-shot object detection (FSOD) studies how to detect novel objects with few annotated examples effectively. Recently, it has been demonstrated that decent feature embeddings, including the general feature embeddings that are more invariant to visual changes and the specific feature embeddings that are more discriminative for different object classes, are both important for FSOD. However, current methods lack appropriate mechanisms to sensibly cooperate both types of feature embeddings based on their importance to detecting objects of novel classes, which may result in sub-optimal performance. In this paper, to achieve more effective FSOD, we attempt to explicitly encode both general and specific feature embeddings using learnable tensors and apply a Transformer to help better incorporate them in FSOD according to their relations to the input object features. We thus propose a Transformer-based general and specific embedding learning (T-GSEL) method for FSOD. In T-GSEL, learnable tensors are employed in a three-stage pipeline, encoding feature embeddings in general level, intermediate level, and specific level, respectively. In each stage, we apply a Transformer to first model the relations of the corresponding embedding to input object features and then apply the estimated relations to refine the input features. Meanwhile, we further introduce cross-stage connections between embeddings of different stages to make them complement and cooperate with each other, delivering general, intermediate, and specific feature embeddings stage by stage and utilizing them together for feature refinement in FSOD. In practice, a T-GSEL module is easy to inject. Extensive empirical results further show that our proposed T-GSEL method achieves compelling FSOD performance on both PASCAL VOC and MS COCO datasets compared with other state-of-the-art approaches.

Journal Article

Share this book

Add to My Shelf

Deep learning and multi-omics approach to predict drug responses in cancer

by Wang, Conghao , Lye, Xintong , Rajapakse, Jagath C. in Ablation , Algorithms , Antineoplastic drugs

2022

Background Cancers are genetically heterogeneous, so anticancer drugs show varying degrees of effectiveness on patients due to their differing genetic profiles. Knowing patient’s responses to numerous cancer drugs are needed for personalized treatment for cancer. By using molecular profiles of cancer cell lines available from Cancer Cell Line Encyclopedia (CCLE) and anticancer drug responses available in the Genomics of Drug Sensitivity in Cancer (GDSC), we will build computational models to predict anticancer drug responses from molecular features. Results We propose a novel deep neural network model that integrates multi-omics data available as gene expressions, copy number variations, gene mutations, reverse phase protein array expressions, and metabolomics expressions, in order to predict cellular responses to known anti-cancer drugs. We employ a novel graph embedding layer that incorporates interactome data as prior information for prediction. Moreover, we propose a novel attention layer that effectively combines different omics features, taking their interactions into account. The network outperformed feedforward neural networks and reported 0.90 for R 2 values for prediction of drug responses from cancer cell lines data available in CCLE and GDSC. Conclusion The outstanding results of our experiments demonstrate that the proposed method is capable of capturing the interactions of genes and proteins, and integrating multi-omics features effectively. Furthermore, both the results of ablation studies and the investigations of the attention layer imply that gene mutation has a greater influence on the prediction of drug responses than other omics data types. Therefore, we conclude that our approach can not only predict the anti-cancer drug response precisely but also provides insights into reaction mechanisms of cancer cell lines and drugs as well.

Journal Article

Share this book

Add to My Shelf

Probabilistic-Based Feature Embedding of 4-D Light Fields for Compressive Imaging and Denoising

by Lyu, Xianqiang , Hou, Junhui in Embedded systems , Embedding , Image reconstruction

2024

The high-dimensional nature of the 4-D light field (LF) poses great challenges in achieving efficient and effective feature embedding, that severely impacts the performance of downstream tasks. To tackle this crucial issue, in contrast to existing methods with empirically-designed architectures, we propose a probabilistic-based feature embedding (PFE), which learns a feature embedding architecture by assembling various low-dimensional convolution patterns in a probability space for fully capturing spatial-angular information. Building upon the proposed PFE, we then leverage the intrinsic linear imaging model of the coded aperture camera to construct a cycle-consistent 4-D LF reconstruction network from coded measurements. Moreover, we incorporate PFE into an iterative optimization framework for 4-D LF denoising. Our extensive experiments demonstrate the significant superiority of our methods on both real-world and synthetic 4-D LF images, both quantitatively and qualitatively, when compared with state-of-the-art methods. The source code will be publicly available at https://github.com/lyuxianqiang/LFCA-CR-NET.

Journal Article

Share this book

Add to My Shelf

Prediction of protein solubility based on sequence physicochemical patterns and distributed representation information with DeepSoluE

by Wang, Chao , Zou, Quan in Amino Acid Sequence , Amino acids , Analysis

2023

Background Protein solubility is a precondition for efficient heterologous protein expression at the basis of most industrial applications and for functional interpretation in basic research. However, recurrent formation of inclusion bodies is still an inevitable roadblock in protein science and industry, where only nearly a quarter of proteins can be successfully expressed in soluble form. Despite numerous solubility prediction models having been developed over time, their performance remains unsatisfactory in the context of the current strong increase in available protein sequences. Hence, it is imperative to develop novel and highly accurate predictors that enable the prioritization of highly soluble proteins to reduce the cost of actual experimental work. Results In this study, we developed a novel tool, DeepSoluE, which predicts protein solubility using a long-short-term memory (LSTM) network with hybrid features composed of physicochemical patterns and distributed representation of amino acids. Comparison results showed that the proposed model achieved more accurate and balanced performance than existing tools. Furthermore, we explored specific features that have a dominant impact on the model performance as well as their interaction effects. Conclusions DeepSoluE is suitable for the prediction of protein solubility in E. coli ; it serves as a bioinformatics tool for prescreening of potentially soluble targets to reduce the cost of wet-experimental studies. The publicly available webserver is freely accessible at http://lab.malab.cn/~wangchao/softs/DeepSoluE/ .

Journal Article

Share this book

Add to My Shelf

An integration of deep learning with feature embedding for protein–protein interaction prediction

by Diao, Yanyu , Yao, Yu , Zhu, Huaixu in Accuracy , Amino acid sequence , Amino acids

2019

Protein–protein interactions are closely relevant to protein function and drug discovery. Hence, accurately identifying protein–protein interactions will help us to understand the underlying molecular mechanisms and significantly facilitate the drug discovery. However, the majority of existing computational methods for protein–protein interactions prediction are focused on the feature extraction and combination of features and there have been limited gains from the state-of-the-art models. In this work, a new residue representation method named Res2vec is designed for protein sequence representation. Residue representations obtained by Res2vec describe more precisely residue-residue interactions from raw sequence and supply more effective inputs for the downstream deep learning model. Combining effective feature embedding with powerful deep learning techniques, our method provides a general computational pipeline to infer protein–protein interactions, even when protein structure knowledge is entirely unknown. The proposed method DeepFE-PPI is evaluated on the S. Cerevisiae and human datasets. The experimental results show that DeepFE-PPI achieves 94.78% (accuracy), 92.99% (recall), 96.45% (precision), 89.62% (Matthew’s correlation coefficient, MCC) and 98.71% (accuracy), 98.54% (recall), 98.77% (precision), 97.43% (MCC), respectively. In addition, we also evaluate the performance of DeepFE-PPI on five independent species datasets and all the results are superior to the existing methods. The comparisons show that DeepFE-PPI is capable of predicting protein–protein interactions by a novel residue representation method and a deep learning classification framework in an acceptable level of accuracy. The codes along with instructions to reproduce this work are available from https://github.com/xal2019/DeepFE-PPI .

Journal Article

Share this book

Add to My Shelf

Attention-based convolutional neural network for deep face recognition

by Chen, Jiazhong , Ling, Hefei , Huang, Junrui in Artificial neural networks , Computer Communication Networks , Computer Science

2020

Discriminative feature embedding is of essential importance in the field of large scale face recognition. In this paper, we propose an attention-based convolutional neural network (ACNN) for discriminative face feature embedding, which aims to decrease the information redundancy among channels and focus on the most informative components of spatial feature maps. More specifically, the proposed attention module consists of a channel attention block and a spatial attention block which adaptively aggregate the feature maps in both channel and spatial domains to learn the inter-channel relationship matrix and the inter-spatial relationship matrix, then matrix multiplications are conducted for a refined and robust face feature. With the attention module we proposed, we can make standard convolutional neural networks (CNNs), such as ResNet-50, ResNet-101 have more discriminative power for deep face recognition. The experiments on Labelled Faces in the Wild (LFW), Age Database (AgeDB), Celebrities in Frontal Profile (CFP) and MegaFace Challenge 1 (MF1) show that our proposed ACNN architecture consistently outperforms naive CNNs and achieves the state-of-the-art performance.

Journal Article

Share this book

Add to My Shelf

Parameterized hypercomplex convolutional network for accurate protein backbone torsion angle prediction

by Wei, Shujia , Zhang, Lei , Yang, Wei in 631/337 , 631/535/1267 , 631/61

2024

Predicting the backbone torsion angles corresponding to each residue of a protein from its amino acid sequence alone is a challenging problem in computational biology. Existing torsion angle predictors mainly use profile features, which are generated by performing time-consuming multiple sequence alignments, for torsion angle prediction. Compared with traditional profile features, embedding features from pretrained protein language models have significant advantages in prediction performance and computational speed. However, embedding features usually have higher dimensions and different embedding features have significantly different dimensions. To this end, we design a novel parameter-efficient deep torsion angle predictor, PHAngle, specifically for embedding features. PHAngle is a parameterized hypercomplex convolutional network consisting of parameterized hypercomplex linear and convolutional layers whose weight parameters can be characterized as the sum of Kronecker products. Experimental results on six benchmark test sets including TEST2016, TEST2018, TEST2020_HQ, CASP12, CASP13 and CASP-FM demonstrate that PHAngle achieves the state-of-the-art torsion angle performance with the fewest parameters compared to the nine existing methods. The source code and datasets are available at https://github.com/fengtuan/PHAngle .

Journal Article

Share this book

Add to My Shelf

Evaluating the Performance of wav2vec Embedding for Parkinson's Disease Detection

by Příhoda, David , Klempíř, Ondřej , Krupička, Radim in Classification , Datasets , deep learning

2023

Speech is one of the most serious manifestations of Parkinson's disease (PD). Sophisticated language/speech models have already demonstrated impressive performance on a variety of tasks, including classification. By analysing large amounts of data from a given setting, these models can identify patterns that would be difficult for clinicians to detect. We focus on evaluating the performance of a large self-supervised speech representation model, wav2vec, for PD classification. Based on the computed wav2vec embedding for each available speech signal, we calculated two sets of 512 derived features, wav2vec-sum and wav2vec-mean. Unlike traditional signal processing methods, this approach can learn a suitable representation of the signal directly from the data without requiring manual or hand-crafted feature extraction. Using an ensemble random forest classifier, we evaluated the embedding-based features on three different healthy vs. PD datasets (participants rhythmically repeat syllables /pa/, Italian dataset and English dataset). The obtained results showed that the wav2vec signal representation was accurate, with a minimum area under the receiver operating characteristic curve (AUROC) of 0.77 for the /pa/ task and the best AUROC of 0.98 for the Italian speech classification. The findings highlight the potential of the generalisability of the wav2vec features and the performance of these features in the cross-database scenarios.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter