Catalogue Search | MBRL

Modeling aspects of the language of life through transfer-learning protein sequences

by Rost, Burkhard , Elnaggar, Ahmed , Nechaev, Dmitrii in Algorithms , Amino Acid Sequence , Amino acids

2019

Background Predicting protein function and structure from sequence is one important challenge for computational biology. For 26 years, most state-of-the-art approaches combined machine learning and evolutionary information. However, for some applications retrieving related proteins is becoming too time-consuming. Additionally, evolutionary information is less powerful for small families, e.g. for proteins from the Dark Proteome . Both these problems are addressed by the new methodology introduced here. Results We introduced a novel way to represent protein sequences as continuous vectors ( embeddings ) by using the language model ELMo taken from natural language processing. By modeling protein sequences, ELMo effectively captured the biophysical properties of the language of life from unlabeled big data (UniRef50). We refer to these new embeddings as SeqVec ( Seq uence-to- Vec tor) and demonstrate their effectiveness by training simple neural networks for two different tasks. At the per-residue level, secondary structure (Q3 = 79% ± 1, Q8 = 68% ± 1) and regions with intrinsic disorder (MCC = 0.59 ± 0.03) were predicted significantly better than through one-hot encoding or through Word2vec-like approaches. At the per-protein level, subcellular localization was predicted in ten classes (Q10 = 68% ± 1) and membrane-bound were distinguished from water-soluble proteins (Q2 = 87% ± 1). Although SeqVec embeddings generated the best predictions from single sequences, no solution improved over the best existing method using evolutionary information. Nevertheless, our approach improved over some popular methods using evolutionary information and for some proteins even did beat the best. Thus, they prove to condense the underlying principles of protein sequences. Overall, the important novelty is speed: where the lightning-fast HHblits needed on average about two minutes to generate the evolutionary information for a target protein, SeqVec created embeddings on average in 0.03 s. As this speed-up is independent of the size of growing sequence databases, SeqVec provides a highly scalable approach for the analysis of big data in proteomics, i.e. microbiome or metaproteome analysis. Conclusion Transfer-learning succeeded to extract information from unlabeled sequence databases relevant for various protein prediction tasks. SeqVec modeled the language of life, namely the principles underlying protein sequences better than any features suggested by textbooks and prediction methods. The exception is evolutionary information, however, that information is not available on the level of a single sequence.

Journal Article

Share this book

Add to My Shelf

DPDDI: a deep predictor for drug-drug interactions

by Zhang, Shao-Wu , Shi, Jian-Yu , Feng, Yue-Hua in Agglomeration , Algorithms , Analysis

2020

Background The treatment of complex diseases by taking multiple drugs becomes increasingly popular. However, drug-drug interactions (DDIs) may give rise to the risk of unanticipated adverse effects and even unknown toxicity. DDI detection in the wet lab is expensive and time-consuming. Thus, it is highly desired to develop the computational methods for predicting DDIs. Generally, most of the existing computational methods predict DDIs by extracting the chemical and biological features of drugs from diverse drug-related properties, however some drug properties are costly to obtain and not available in many cases. Results In this work, we presented a novel method (namely DPDDI) to predict DDIs by extracting the network structure features of drugs from DDI network with graph convolution network (GCN), and the deep neural network (DNN) model as a predictor. GCN learns the low-dimensional feature representations of drugs by capturing the topological relationship of drugs in DDI network. DNN predictor concatenates the latent feature vectors of any two drugs as the feature vector of the corresponding drug pairs to train a DNN for predicting the potential drug-drug interactions. Experiment results show that, the newly proposed DPDDI method outperforms four other state-of-the-art methods; the GCN-derived latent features include more DDI information than other features derived from chemical, biological or anatomical properties of drugs; and the concatenation feature aggregation operator is better than two other feature aggregation operators (i.e., inner product and summation). The results in case studies confirm that DPDDI achieves reasonable performance in predicting new DDIs. Conclusion We proposed an effective and robust method DPDDI to predict the potential DDIs by utilizing the DDI network information without considering the drug properties (i.e., drug chemical and biological properties). The method should also be useful in other DDI-related scenarios, such as the detection of unexpected side effects, and the guidance of drug combination.

Journal Article

Share this book

Add to My Shelf

Graph-based prediction of Protein-protein interactions with attributed signed graph embedding

by Song, Dandan , Yang, Fang , Lin, Huakang in Accuracy , Algorithms , Bioinformatics

2020

Background Protein-protein interactions (PPIs) are central to many biological processes. Considering that the experimental methods for identifying PPIs are time-consuming and expensive, it is important to develop automated computational methods to better predict PPIs. Various machine learning methods have been proposed, including a deep learning technique which is sequence-based that has achieved promising results. However, it only focuses on sequence information while ignoring the structural information of PPI networks. Structural information of PPI networks such as their degree, position, and neighboring nodes in a graph has been proved to be informative in PPI prediction. Results Facing the challenge of representing graph information, we introduce an improved graph representation learning method. Our model can study PPI prediction based on both sequence information and graph structure. Moreover, our study takes advantage of a representation learning model and employs a graph-based deep learning method for PPI prediction, which shows superiority over existing sequence-based methods. Statistically, Our method achieves state-of-the-art accuracy of 99.15% on Human protein reference database (HPRD) dataset and also obtains best results on Database of Interacting Protein (DIP) Human, Drosophila , Escherichia coli ( E. coli ), and Caenorhabditis elegans ( C. elegan ) datasets. Conclusion Here, we introduce signed variational graph auto-encoder (S-VGAE), an improved graph representation learning method, to automatically learn to encode graph structure into low-dimensional embeddings. Experimental results demonstrate that our method outperforms other existing sequence-based methods on several datasets. We also prove the robustness of our model for very sparse networks and the generalization for a new dataset that consists of four datasets: HPRD, E.coli , C.elegan , and Drosophila .

Journal Article

Share this book

Add to My Shelf

Comprehensive ensemble in QSAR prediction for drug discovery

by Yoon, Sungroh , Jo, Jeonghee , Bae, Ho in Algorithms , Analysis , Bioinformatics

2019

Background Quantitative structure-activity relationship (QSAR) is a computational modeling method for revealing relationships between structural properties of chemical compounds and biological activities. QSAR modeling is essential for drug discovery, but it has many constraints. Ensemble-based machine learning approaches have been used to overcome constraints and obtain reliable predictions. Ensemble learning builds a set of diversified models and combines them. However, the most prevalent approach random forest and other ensemble approaches in QSAR prediction limit their model diversity to a single subject. Results The proposed ensemble method consistently outperformed thirteen individual models on 19 bioassay datasets and demonstrated superiority over other ensemble approaches that are limited to a single subject. The comprehensive ensemble method is publicly available at http://data.snu.ac.kr/QSAR/ . Conclusions We propose a comprehensive ensemble method that builds multi-subject diversified models and combines them through second-level meta-learning. In addition, we propose an end-to-end neural network-based individual classifier that can automatically extract sequential features from a simplified molecular-input line-entry system (SMILES). The proposed individual models did not show impressive results as a single model, but it was considered the most important predictor when combined, according to the interpretation of the meta-learning.

Journal Article

Share this book

Add to My Shelf

Prediction of heart disease and classifiers’ sensitivity analysis

by Almustafa, Khaled Mohamad in Algorithms , Bayesian analysis , Bioinformatics

2020

Background Heart disease (HD) is one of the most common diseases nowadays, and an early diagnosis of such a disease is a crucial task for many health care providers to prevent their patients for such a disease and to save lives. In this paper, a comparative analysis of different classifiers was performed for the classification of the Heart Disease dataset in order to correctly classify and or predict HD cases with minimal attributes. The set contains 76 attributes including the class attribute, for 1025 patients collected from Cleveland, Hungary, Switzerland, and Long Beach, but in this paper, only a subset of 14 attributes are used, and each attribute has a given set value. The algorithms used K- Nearest Neighbor (K-NN), Naive Bayes, Decision tree J48, JRip, SVM, Adaboost, Stochastic Gradient Decent (SGD) and Decision Table (DT) classifiers to show the performance of the selected classifications algorithms to best classify, and or predict, the HD cases. Results It was shown that using different classification algorithms for the classification of the HD dataset gives very promising results in term of the classification accuracy for the K-NN (K = 1), Decision tree J48 and JRip classifiers with accuracy of classification of 99.7073, 98.0488 and 97.2683% respectively. A feature extraction method was performed using Classifier Subset Evaluator on the HD dataset, and results show enhanced performance in term of the classification accuracy for K-NN ( N = 1) and Decision Table classifiers to 100 and 93.8537% respectively after using the selected features by only applying a combination of up to 4 attributes instead of 13 attributes for the predication of the HD cases. Conclusion Different classifiers were used and compared to classify the HD dataset, and we concluded the benefit of having a reliable feature selection method for HD disease prediction with using minimal number of attributes instead of having to consider all available ones.

Journal Article

Share this book

Add to My Shelf

ProteinNet: a standardized data set for machine learning of protein structure

by AlQuraishi, Mohammed in Accessibility , Algorithms , Amino Acid Sequence

2019

Background Rapid progress in deep learning has spurred its application to bioinformatics problems including protein structure prediction and design. In classic machine learning problems like computer vision, progress has been driven by standardized data sets that facilitate fair assessment of new methods and lower the barrier to entry for non-domain experts. While data sets of protein sequence and structure exist, they lack certain components critical for machine learning, including high-quality multiple sequence alignments and insulated training/validation splits that account for deep but only weakly detectable homology across protein space. Results We created the ProteinNet series of data sets to provide a standardized mechanism for training and assessing data-driven models of protein sequence-structure relationships. ProteinNet integrates sequence, structure, and evolutionary information in programmatically accessible file formats tailored for machine learning frameworks. Multiple sequence alignments of all structurally characterized proteins were created using substantial high-performance computing resources. Standardized data splits were also generated to emulate the difficulty of past CASP (Critical Assessment of protein Structure Prediction) experiments by resetting protein sequence and structure space to the historical states that preceded six prior CASPs. Utilizing sensitive evolution-based distance metrics to segregate distantly related proteins, we have additionally created validation sets distinct from the official CASP sets that faithfully mimic their difficulty. Conclusion ProteinNet represents a comprehensive and accessible resource for training and assessing machine-learned models of protein structure.

Journal Article

Share this book

Add to My Shelf

Amino Acid Encoding for Deep Learning Applications

by Bromberg, Yana , Lenz, Tobias , Wendorff, Mareike in Algorithms , Amino acid encoding , Amino acids

2020

Background: The number of applications of deep learning algorithms in bioinformatics is increasing as they usually achieve superior performance over classical approaches, especially, when bigger training datasets are available. In deep learning applications, discrete data, e.g. words or n-grams in language, or amino acids or nucleotides in bioinformatics, are generally represented as a continuous vector through an embedding matrix. Recently, learning this embedding matrix directly from the data as part of the continuous iteration of the model to optimize the target prediction – a process called ‘end-to-end learning’ – has led to state-of-the-art results in many fields. Although usage of embeddings is well described in the bioinformatics literature, the potential of end-to-end learning for single amino acids, as compared to more classical manually-curated encoding strategies, has not been systematically addressed. To this end, we compared classical encoding matrices, namely one-hot, VHSE8 and BLOSUM62, to end-to-end learning of amino acid embeddings for two different prediction tasks using three widely used architectures, namely recurrent neural networks (RNN), convolutional neural networks (CNN), and the hybrid CNN-RNN. Results: By using different deep learning architectures, we show that end-to-end learning is on par with classical encodings for embeddings of the same dimension even when limited training data is available, and might allow for a reduction in the embedding dimension without performance loss, which is critical when deploying the models to devices with limited computational capacities. We found that the embedding dimension is a major factor in controlling the model performance. Surprisingly, we observed that deep learning models are capable of learning from random vectors of appropriate dimension. Conclusion: Our study shows that end-to-end learning is a flexible and powerful method for amino acid encoding. Further, due to the flexibility of deep learning systems, amino acid encoding schemes should be benchmarked against random vectors of the same dimension to disentangle the information content provided by the encoding scheme from the distinguishability effect provided by the scheme.

Journal Article

Share this book

Add to My Shelf

FeatureSelect: a software for feature selection based on machine learning approaches

by Masoudi-Sobhanzadeh, Yosef , Masoudi-Nejad, Ali , Motieghader, Habib in Algorithms , Artificial intelligence , Bioinformatics

2019

Background Feature selection, as a preprocessing stage, is a challenging problem in various sciences such as biology, engineering, computer science, and other fields. For this purpose, some studies have introduced tools and softwares such as WEKA. Meanwhile, these tools or softwares are based on filter methods which have lower performance relative to wrapper methods. In this paper, we address this limitation and introduce a software application called FeatureSelect. In addition to filter methods, FeatureSelect consists of optimisation algorithms and three types of learners. It provides a user-friendly and straightforward method of feature selection for use in any kind of research, and can easily be applied to any type of balanced and unbalanced data based on several score functions like accuracy, sensitivity, specificity, etc. Results In addition to our previously introduced optimisation algorithm (WCC), a total of 10 efficient, well-known and recently developed algorithms have been implemented in FeatureSelect. We applied our software to a range of different datasets and evaluated the performance of its algorithms. Acquired results show that the performances of algorithms are varying on different datasets, but WCC, LCA, FOA, and LA are suitable than others in the overall state. The results also show that wrapper methods are better than filter methods. Conclusions FeatureSelect is a feature or gene selection software application which is based on wrapper methods. Furthermore, it includes some popular filter methods and generates various comparison diagrams and statistical measurements. It is available from GitHub ( https://github.com/LBBSoft/FeatureSelect ) and is free open source software under an MIT license.

Journal Article

Share this book

Add to My Shelf

Novel deep learning model for more accurate prediction of drug-drug interaction effects

by Lee, Geonhee , Park, Chihyun , Ahn, Jaegyoon in Algorithms , Autoencoder , Bioinformatics

2019

Background Predicting the effect of drug-drug interactions (DDIs) precisely is important for safer and more effective drug co-prescription. Many computational approaches to predict the effect of DDIs have been proposed, with the aim of reducing the effort of identifying these interactions in vivo or in vitro, but room remains for improvement in prediction performance. Results In this study, we propose a novel deep learning model to predict the effect of DDIs more accurately.. The proposed model uses autoencoders and a deep feed-forward network that are trained using the structural similarity profiles (SSP), Gene Ontology (GO) term similarity profiles (GSP), and target gene similarity profiles (TSP) of known drug pairs to predict the pharmacological effects of DDIs. The results show that GSP and TSP increase the prediction accuracy when using SSP alone, and the autoencoder is more effective than PCA for reducing the dimensions of each profile. Our model showed better performance than the existing methods, and identified a number of novel DDIs that are supported by medical databases or existing research. Conclusions We present a novel deep learning model for more accurate prediction of DDIs and their effects, which may assist in future research to discover novel DDIs and their pharmacological effects.

Journal Article

Share this book

Add to My Shelf

Malaria parasite detection in thick blood smear microscopic images using modified YOLOV3 and YOLOV4 models

by Aliy, Mohammed , Fante, Kinde Anlay , Abdurahman, Fetulhak in Algorithms , Animals , Bioinformatics

2021

Background Manual microscopic examination of Leishman/Giemsa stained thin and thick blood smear is still the “gold standard” for malaria diagnosis. One of the drawbacks of this method is that its accuracy, consistency, and diagnosis speed depend on microscopists’ diagnostic and technical skills. It is difficult to get highly skilled microscopists in remote areas of developing countries. To alleviate this problem, in this paper, we propose to investigate state-of-the-art one-stage and two-stage object detection algorithms for automated malaria parasite screening from microscopic image of thick blood slides. Results YOLOV3 and YOLOV4 models, which are state-of-the-art object detectors in accuracy and speed, are not optimized for detecting small objects such as malaria parasites in microscopic images. We modify these models by increasing feature scale and adding more detection layers to enhance their capability of detecting small objects without notably decreasing detection speed. We propose one modified YOLOV4 model, called YOLOV4-MOD and two modified models of YOLOV3, which are called YOLOV3-MOD1 and YOLOV3-MOD2. Besides, new anchor box sizes are generated using K-means clustering algorithm to exploit the potential of these models in small object detection. The performance of the modified YOLOV3 and YOLOV4 models were evaluated on a publicly available malaria dataset. These models have achieved state-of-the-art accuracy by exceeding performance of their original versions, Faster R-CNN, and SSD in terms of mean average precision (mAP), recall, precision, F1 score, and average IOU. YOLOV4-MOD has achieved the best detection accuracy among all the other models with a mAP of 96.32%. YOLOV3-MOD2 and YOLOV3-MOD1 have achieved mAP of 96.14% and 95.46%, respectively. Conclusions The experimental results of this study demonstrate that performance of modified YOLOV3 and YOLOV4 models are highly promising for detecting malaria parasites from images captured by a smartphone camera over the microscope eyepiece. The proposed system is suitable for deployment in low-resource setting areas.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter