Catalogue Search | MBRL

SpliceFinder: ab initio prediction of splice sites using convolutional neural network

by Wang, Ruohan , Wang, Jianping , Li, Shuaicheng in Algorithms , Analysis , Animals

2019

Background Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on splice sites with important biological functions. Meanwhile, the dinucleotides occur frequently at the sequences without splice sites, which makes the prediction prone to generate false positives. Most existing tools select all the sequences with the two dimers and then focus on distinguishing the true splice sites from those pseudo ones. Such an approach will lead to a decrease in false positives; however, it will result in non-canonical splice sites missing. Result We have designed SpliceFinder based on convolutional neural network (CNN) to predict splice sites. To achieve the ab initio prediction, we used human genomic data to train our neural network. An iterative approach is adopted to reconstruct the dataset, which tackles the data unbalance problem and forces the model to learn more features of splice sites. The proposed CNN obtains the classification accuracy of 90.25 % , which is 10% higher than the existing algorithms. The method outperforms other existing methods in terms of area under receiver operating characteristics (AUC), recall, precision, and F1 score. Furthermore, SpliceFinder can find the exact position of splice sites on long genomic sequences with a sliding window. Compared with other state-of-the-art splice site prediction tools, SpliceFinder generates results in about half lower false positive while keeping recall higher than 0.8. Also, SpliceFinder captures the non-canonical splice sites. In addition, SpliceFinder performs well on the genomic sequences of Drosophila melanogaster , Mus musculus , Rattus , and Danio rerio without retraining. Conclusion Based on CNN, we have proposed a new ab initio splice site prediction tool, SpliceFinder, which generates less false positives and can detect non-canonical splice sites. Additionally, SpliceFinder is transferable to other species without retraining. The source code and additional materials are available at https://gitlab.deepomics.org/wangruohan/SpliceFinder .

Journal Article

Share this book

Add to My Shelf

Spliceator: multi-species splice site prediction using convolutional neural networks

by Scalzitti, Nicolas , Kress, Arnaud , Orhand, Romain in Algorithms , Analysis , Annotations

2021

Background Ab initio prediction of splice sites is an essential step in eukaryotic genome annotation. Recent predictors have exploited Deep Learning algorithms and reliable gene structures from model organisms. However, Deep Learning methods for non-model organisms are lacking. Results We developed Spliceator to predict splice sites in a wide range of species, including model and non-model organisms. Spliceator uses a convolutional neural network and is trained on carefully validated data from over 100 organisms. We show that Spliceator achieves consistently high accuracy (89–92%) compared to existing methods on independent benchmarks from human, fish, fly, worm, plant and protist organisms. Conclusions Spliceator is a new Deep Learning method trained on high-quality data, which can be used to predict splice sites in diverse organisms, ranging from human to protists, with consistently high accuracy.

Journal Article

Share this book

Add to My Shelf

Impact of U2-type introns on splice site prediction in A. thaliana species using deep learning

by De Neve, Wesley , Depuydt, Stephen , Van Messem, Arnout in Acceptor sites , Algorithms , Applied Mathematics

2025

Background Splice site prediction in plant genomes poses substantial challenges that can be addressed using deep learning models. U2-type introns are especially useful for such studies given their ubiquity in plant genomes and the availability of rich datasets. We formulated two hypotheses: one proposing that short introns may enhance prediction effectiveness due to reduced spatial complexity, and another suggesting that sequences with multiple introns provide a richer context for splicing events. Results Our findings demonstrate that (1) models trained on datasets containing shorter introns achieve improved effectiveness for acceptor splice sites, but not for donor splice sites, indicating a more nuanced relationship between intron length and splice site prediction than initially hypothesized, and (2) models trained on datasets with multiple introns per sequence show higher effectiveness compared to those trained on datasets with a single intron per sequence. Notably, among the 402 bp sequences analyzed, 72% contained single introns while 28% contained multiple introns for donor sites (36,399 versus 13,987 sequences), with similar proportions observed for acceptor sites (37,236 versus 14,112 sequences). These computational insights align with biological observations, particularly regarding the conserved spatial relationship between branch points and acceptor splice sites, as well as the synergistic effects of multiple introns on splicing efficiency. Conclusions The obtained results contribute to a deeper understanding of how intronic features influence splice site prediction and suggest that future prediction models should consider factors such as intron length, multiplicity, and the spatial arrangement of splice-related signals.

Journal Article

Share this book

Add to My Shelf

OpenSpliceAI provides an efficient modular implementation of SpliceAI enabling easy retraining across nonhuman species

by Chao, Kuan-Hao , Salzberg, Steven L , Pertea, Mihaela in Animals , Computational and Systems Biology , Computational Biology - methods

2025

The SpliceAI deep learning system is currently one of the most accurate methods for identifying splicing signals directly from DNA sequences. However, its utility is limited by its reliance on older software frameworks and human-centric training data. Here, we introduce OpenSpliceAI, a trainable, open-source version of SpliceAI implemented in PyTorch to address these challenges. OpenSpliceAI supports both training from scratch and transfer learning, enabling seamless retraining on species-specific datasets and mitigating human-centric biases. Our experiments show that it achieves faster processing speeds and lower memory usage than the original SpliceAI code, allowing large-scale analyses of extensive genomic regions on a single GPU. Additionally, OpenSpliceAI’s flexible architecture makes for easier integration with established machine learning ecosystems, simplifying the development of custom splicing models for different species and applications. We demonstrate that OpenSpliceAI’s output is highly concordant with SpliceAI. In silico mutagenesis analyses confirm that both models rely on similar sequence features, and calibration experiments demonstrate similar score probability estimates.

Journal Article

Share this book

Add to My Shelf

DRANetSplicer: A Splice Site Prediction Model Based on Deep Residual Attention Networks

by Zhang, Hongyan , Zhu, Lei , Zhu, Xinghui in Annotations , Arabidopsis - genetics , Arabidopsis thaliana

2024

The precise identification of splice sites is essential for unraveling the structure and function of genes, constituting a pivotal step in the gene annotation process. In this study, we developed a novel deep learning model, DRANetSplicer, that integrates residual learning and attention mechanisms for enhanced accuracy in capturing the intricate features of splice sites. We constructed multiple datasets using the most recent versions of genomic data from three different organisms, Oryza sativa japonica, Arabidopsis thaliana and Homo sapiens. This approach allows us to train models with a richer set of high-quality data. DRANetSplicer outperformed benchmark methods on donor and acceptor splice site datasets, achieving an average accuracy of (96.57%, 95.82%) across the three organisms. Comparative analyses with benchmark methods, including SpliceFinder, Splice2Deep, Deep Splicer, EnsembleSplice, and DNABERT, revealed DRANetSplicer’s superior predictive performance, resulting in at least a (4.2%, 11.6%) relative reduction in average error rate. We utilized the DRANetSplicer model trained on O. sativa japonica data to predict splice sites in A. thaliana, achieving accuracies for donor and acceptor sites of (94.89%, 94.25%). These results indicate that DRANetSplicer possesses excellent cross-organism predictive capabilities, with its performance in cross-organism predictions even surpassing that of benchmark methods in non-cross-organism predictions. Cross-organism validation showcased DRANetSplicer’s excellence in predicting splice sites across similar organisms, supporting its applicability in gene annotation for understudied organisms. We employed multiple methods to visualize the decision-making process of the model. The visualization results indicate that DRANetSplicer can learn and interpret well-known biological features, further validating its overall performance. Our study systematically examined and confirmed the predictive ability of DRANetSplicer from various levels and perspectives, indicating that its practical application in gene annotation is justified.

Journal Article

Share this book

Add to My Shelf

Novel LYST Variants Lead to Aberrant Splicing in a Patient with Chediak–Higashi Syndrome

by Abasov, Ruslan , Raykina, Elena , Rodina, Yulia in Albinism , Case Report , Chediak-Higashi syndrome

2025

Background: The advent of next-generation sequencing (NGS) has revolutionized the analysis of genetic data, enabling rapid identification of pathogenic variants in patients with inborn errors of immunity (IEI). Sometimes, the use of NGS-based technologies is associated with challenges in the evaluation of the clinical significance of novel genetic variants. Methods: In silico prediction tools, such as SpliceAI neural network, are often used as a first-tier approach for the primary examination of genetic variants of uncertain clinical significance. Such tools allow us to parse through genetic data and emphasize potential splice-altering variants. Further variant assessment requires precise RNA assessment by agarose gel electrophoresis and/or cDNA Sanger sequencing. Results: We found two novel heterozygous variants in the coding region of the LYST gene (c.10104G>T, c.10894A>G) in an individual with a typical clinical presentation of Chediak–Higashi syndrome (CHS). The SpliceAI neural network predicted both variants as probably splice-altering. cDNA assessment by agarose gel electrophoresis revealed the presence of abnormally shortened splicing products in each variant’s case, and cDNA Sanger sequencing demonstrated that c.10104G>T and c.10894A>G substitutions resulted in a shortening of the 44 and 49 exons by 41 and 47 bp, respectively. Both mutations probably lead to a frameshift and the formation of a premature termination codon. This, in turn, may disrupt the structure and/or function of the LYST protein. Conclusions: We identified two novel variants in the LYST gene, predicted to be deleterious by the SpliceAI neural network. Agarose gel cDNA electrophoresis and cDNA Sanger sequencing allowed us to verify inappropriate splicing patterns and establish these variants as disease-causing.

Journal Article

Share this book

Add to My Shelf

Prediction of Back-splicing sites for CircRNA formation based on convolutional neural networks

by Shen, Zhen , Liu, Wei , Yuan, Lin in Algorithms , Animal Genetics and Genomics , Artificial neural networks

2022

Background Circular RNAs (CircRNAs) play critical roles in gene expression regulation and disease development. Understanding the regulation mechanism of CircRNAs formation can help reveal the role of CircRNAs in various biological processes mentioned above. Back-splicing is important for CircRNAs formation. Back-splicing sites prediction helps uncover the mysteries of CircRNAs formation. Several methods were proposed for back-splicing sites prediction or circRNA-realted prediction tasks. Model performance was constrained by poor feature learning and using ability. Results In this study, CircCNN was proposed to predict pre-mRNA back-splicing sites. Convolution neural network and batch normalization are the main parts of CircCNN. Experimental results on three datasets show that CircCNN outperforms other baseline models. Moreover, PPM (Position Probability Matrix) features extract by CircCNN were converted as motifs. Further analysis reveals that some of motifs found by CircCNN match known motifs involved in gene expression regulation, the distribution of motif and special short sequence is important for pre-mRNA back-splicing. Conclusions In general, the findings in this study provide a new direction for exploring CircRNA-related gene expression regulatory mechanism and identifying potential targets for complex malignant diseases. The datasets and source code of this study are freely available at: https://github.com/szhh521/CircCNN .

Journal Article

Share this book

Add to My Shelf

Splice site identification in human genome using random forest

by Ozen, Mustafa , Aydin, Nizamettin , Pashaei, Elham in Algorithms , Biological and Medical Physics , Biomedical Engineering and Bioengineering

2017

Gene identification has been an increasingly important task due to developments of Human Genome Project. Splice site prediction lies at the heart of identifying human genes, thus development of new methods which detect the splice site accurately is crucial. Machine learning classifiers are utilized to detect the splice sites. Performance of those classifiers mainly depends on DNA encoding methods (feature extraction) and feature selection. The feature extraction methods try to capture as much information as the DNA sequences have, while the feature selection methods provide useful biological knowledge by cleaning out the redundant information. According to the literature, Markovian models are popular encoding methods and the support vector machine (SVM) is known as the best algorithm for classification of splice sites. However, random forest (RF) may outperform the SVM in this domain using those Markovian encoding methods. In this study, performance of RF has been investigated as feature selection and classification in splice site domain. We proposed three methods, namely MM1-RF, MM2-RF and MCM-RF by combining RF with first order Markov Model (MM1), second order Markov model (MM2), and Markov Chain Model (MCM). We compared the performance of the RF with the SVM competitively on HS3D and NN269 benchmark datasets. Also, we evaluated the efficiency of the proposed methods with other current state of arts methods such as Reduced MM1-SVM, SVM-B and LVMM2. The experimental results show that the RF outperforms the SVM when the same Markovian encoding methods are used on both donor and acceptor datasets. Furthermore, the RF classifier performs much faster than the SVM classifier in detecting the splice sites.

Journal Article

Share this book

Add to My Shelf

Comprehensive genomic analysis of PKHD1 mutations in ARPKD cohorts

by Somlo, S , Gubler, M-C , Guay-Woodford, L M in Amino acids , ARPKD , autosomal recessive polycystic kidney disease

2005

Journal Article

Share this book

Add to My Shelf

Two new methods for DNA splice site prediction based on neuro-fuzzy network and clustering

by Kia, Mohammad , Manzuri Shalmani, Mohammad Taghi , Moghimi, Fahimeh in Artificial Intelligence , Computational Biology/Bioinformatics , Computational Science and Engineering

2013

Nowadays, genetic disorders, like cancer and birth defects, are a great threat to human life. Since the first noticing of these types of diseases, many efforts have been made and researches performed in order to recognize them and find a cure for them. These disorders affect genes and they appear as abnormal traits in a genetic organism. In order to recognize abnormal genes, we need to predict splice sites in a DNA signal; then, we can process the genetic codes between two continuous splice sites and analyze the trait that it represents. In addition to abnormal genes and their consequent disorders, we can also identify other normal human traits like physical and mental features. So the primary issue here is to estimate splice sites precisely. In this paper, we have introduced two new methods in using neuro-fuzzy network and clustering for DNA splice site prediction. In this method, instead of using raw data and nucleotide sequence as an input to neural network, a survey on the first bunch of the nucleotide sequence of true and false categories of the input is carried out and training of the neuro-fuzzy network is achieved based on the similarities and dissimilarities of the selected sequences. In addition, sequences of the large input data are clustered into smaller categories to improve the prediction as they are really spliced based on different mechanisms. Experimental results show that these improvements have increased the recognition rate of the splice sites.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter