Catalogue Search | MBRL

Comprehensive ensemble in QSAR prediction for drug discovery

by Yoon, Sungroh , Jo, Jeonghee , Bae, Ho in Algorithms , Bioinformatics , Biomedical and Life Sciences

2019

Background Quantitative structure-activity relationship (QSAR) is a computational modeling method for revealing relationships between structural properties of chemical compounds and biological activities. QSAR modeling is essential for drug discovery, but it has many constraints. Ensemble-based machine learning approaches have been used to overcome constraints and obtain reliable predictions. Ensemble learning builds a set of diversified models and combines them. However, the most prevalent approach random forest and other ensemble approaches in QSAR prediction limit their model diversity to a single subject. Results The proposed ensemble method consistently outperformed thirteen individual models on 19 bioassay datasets and demonstrated superiority over other ensemble approaches that are limited to a single subject. The comprehensive ensemble method is publicly available at http://data.snu.ac.kr/QSAR/ . Conclusions We propose a comprehensive ensemble method that builds multi-subject diversified models and combines them through second-level meta-learning. In addition, we propose an end-to-end neural network-based individual classifier that can automatically extract sequential features from a simplified molecular-input line-entry system (SMILES). The proposed individual models did not show impressive results as a single model, but it was considered the most important predictor when combined, according to the interpretation of the meta-learning.

Journal Article

Share this book

Add to My Shelf

Prediction of the sequence-specific cleavage activity of Cas9 variants

by Seonwoo, Min , Kim Hyongbum Henry , Kim Nahye in Cleavage , Computer applications , Genomes

2020

Several Streptococcus pyogenes Cas9 (SpCas9) variants have been developed to improve an enzyme’s specificity or to alter or broaden its protospacer-adjacent motif (PAM) compatibility, but selecting the optimal variant for a given target sequence and application remains difficult. To build computational models to predict the sequence-specific activity of 13 SpCas9 variants, we first assessed their cleavage efficiency at 26,891 target sequences. We found that, of the 256 possible four-nucleotide NNNN sequences, 156 can be used as a PAM by at least one of the SpCas9 variants. For the high-fidelity variants, overall activity could be ranked as SpCas9 ≥ Sniper-Cas9 > eSpCas9(1.1) > SpCas9-HF1 > HypaCas9 ≈ xCas9 >> evoCas9, whereas their overall specificities could be ranked as evoCas9 >> HypaCas9 ≥ SpCas9-HF1 ≈ eSpCas9(1.1) > xCas9 > Sniper-Cas9 > SpCas9. Using these data, we developed 16 deep-learning-based computational models that accurately predict the activity of these variants at any target sequence.Deep-learning models predict the Cas9 variant with optimal activity and specificity for any target sequence.

Journal Article

Share this book

Add to My Shelf

Sequence-specific prediction of the efficiencies of adenine and cytosine base editors

by Kim Younggwang , Shin Jeong Hong , Seonwoo, Min in Adenine , Computer applications , Cytosine

2020

Base editors, including adenine base editors (ABEs)1 and cytosine base editors (CBEs)2,3, are widely used to induce point mutations. However, determining whether a specific nucleotide in its genomic context can be edited requires time-consuming experiments. Furthermore, when the editable window contains multiple target nucleotides, various genotypic products can be generated. To develop computational tools to predict base-editing efficiency and outcome product frequencies, we first evaluated the efficiencies of an ABE and a CBE and the outcome product frequencies at 13,504 and 14,157 target sequences, respectively, in human cells. We found that there were only modest asymmetric correlations between the activities of the base editors and Cas9 at the same targets. Using deep-learning-based computational modeling, we built tools to predict the efficiencies and outcome frequencies of ABE- and CBE-directed editing at any target sequence, with Pearson correlations ranging from 0.50 to 0.95. These tools and results will facilitate modeling and therapeutic correction of genetic diseases by base editing.The activity of adenine or cytosine base editors at specific target nucleotides is predicted computationally.

Journal Article

Share this book

Add to My Shelf

DUDE-Seq: Fast, flexible, and robust denoising for targeted amplicon sequencing

by Yoon, Sungroh , Weissman, Tsachy , Moon, Taesup in Algorithms , Automation , Base Sequence

2017

We consider the correction of errors from nucleotide sequences produced by next-generation targeted amplicon sequencing. The next-generation sequencing (NGS) platforms can provide a great deal of sequencing data thanks to their high throughput, but the associated error rates often tend to be high. Denoising in high-throughput sequencing has thus become a crucial process for boosting the reliability of downstream analyses. Our methodology, named DUDE-Seq, is derived from a general setting of reconstructing finite-valued source data corrupted by a discrete memoryless channel and effectively corrects substitution and homopolymer indel errors, the two major types of sequencing errors in most high-throughput targeted amplicon sequencing platforms. Our experimental studies with real and simulated datasets suggest that the proposed DUDE-Seq not only outperforms existing alternatives in terms of error-correction capability and time efficiency, but also boosts the reliability of downstream analyses. Further, the flexibility of DUDE-Seq enables its robust application to different sequencing platforms and analysis pipelines by simple updates of the noise model. DUDE-Seq is available at http://data.snu.ac.kr/pub/dude-seq.

Journal Article

Share this book

Add to My Shelf

Mass spectra prediction with structural motif-based graph neural networks

by Yoon, Sungroh , Jo, Jeonghee , Park, Jiwon in 639/166 , 639/301 , 639/638

2024

Mass spectra, which are agglomerations of ionized fragments from targeted molecules, play a crucial role across various fields for the identification of molecular structures. A prevalent analysis method involves spectral library searches, where unknown spectra are cross-referenced with a database. The effectiveness of such search-based approaches, however, is restricted by the scope of the existing mass spectra database, underscoring the need to expand the database via mass spectra prediction. In this research, we propose the Motif-based Mass Spectrum prediction Network (MoMS-Net), a GNN-based architecture to predict the mass spectra pattern utilizing the structural motif information of the molecule. MoMS-Net considers both a molecule and its substructures as a graph form, which facilitates the incorporation of long-range dependencies while using less memory compared to the graph transformer model. We evaluated our model over various types of mass spectra and showed the validity and superiority over the conventional models.

Journal Article

Share this book

Add to My Shelf

Generation of a more efficient prime editor 2 by addition of the Rad51 DNA-binding domain

by Myungjae Song , Jung Min Lim , Seonwoo Min in 42/41 , 631/1647/1511 , 631/1647/1513/1967/3196

2021

Although prime editing is a promising genome editing method, the efficiency of prime editor 2 (PE2) is often insufficient. Here we generate a more efficient variant of PE2, named hyPE2, by adding the Rad51 DNA-binding domain. When tested at endogenous sites, hyPE2 shows a median of 1.5- or 1.4- fold (range, 0.99- to 2.6-fold) higher efficiencies than PE2; furthermore, at sites where PE2-induced prime editing is very inefficient (efficiency < 1%), hyPE2 enables prime editing with efficiencies ranging from 1.1% to 2.9% at up to 34% of target sequences, potentially facilitating prime editing applications. While prime editing is a promising technology, PE2 systems often have low efficiency. Here the authors fuse a Rad51 DNA-binding domain to create hyPE2 with improved editing efficiency.

Journal Article

Share this book

Add to My Shelf

Protein transfer learning improves identification of heat shock protein families

by Yoon, Sungroh , Min, Seonwoo , Kim, HyunGi in Algorithms , Analysis , Artificial intelligence

2021

Heat shock proteins (HSPs) play a pivotal role as molecular chaperones against unfavorable conditions. Although HSPs are of great importance, their computational identification remains a significant challenge. Previous studies have two major limitations. First, they relied heavily on amino acid composition features, which inevitably limited their prediction performance. Second, their prediction performance was overestimated because of the independent two-stage evaluations and train-test data redundancy. To overcome these limitations, we introduce two novel deep learning algorithms: (1) time-efficient DeepHSP and (2) high-performance DeeperHSP. We propose a convolutional neural network (CNN)-based DeepHSP that classifies both non-HSPs and six HSP families simultaneously. It outperforms state-of-the-art algorithms, despite taking 14–15 times less time for both training and inference. We further improve the performance of DeepHSP by taking advantage of protein transfer learning. While DeepHSP is trained on raw protein sequences, DeeperHSP is trained on top of pre-trained protein representations. Therefore, DeeperHSP remarkably outperforms state-of-the-art algorithms increasing F1 scores in both cross-validation and independent test experiments by 20% and 10%, respectively. We envision that the proposed algorithms can provide a proteome-wide prediction of HSPs and help in various downstream analyses for pathology and clinical research.

Journal Article

Share this book

Add to My Shelf

RNA design rules from a massive open laboratory

by Azizyan, Martin , Treuille, Adrien , Lee, Minjae in Algorithms , artificial intelligence , Au pairs

2014

Self-assembling RNA molecules present compelling substrates for the rational interrogation and control of living systems. However, imperfect in silico models—even at the secondary structure level—hinder the design of new RNAs that function properly when synthesized. Here, we present a unique and potentially general approach to such empirical problems: the Massive Open Laboratory. The EteRNA project connects 37,000 enthusiasts to RNA design puzzles through an online interface. Uniquely, EteRNA participants not only manipulate simulated molecules but also control a remote experimental pipeline for high-throughput RNA synthesis and structure mapping. We show herein that the EteRNA community leveraged dozens of cycles of continuous wet laboratory feedback to learn strategies for solving in vitro RNA design problems on which automated methods fail. The top strategies—including several previously unrecognized negative design rules—were distilled by machine learning into an algorithm, EteRNABot. Over a rigorous 1-y testing phase, both the EteRNA community and EteRNABot significantly outperformed prior algorithms in a dozen RNA secondary structure design tests, including the creation of dendrimer-like structures and scaffolds for small molecule sensors. These results show that an online community can carry out large-scale experiments, hypothesis generation, and algorithm design to create practical advances in empirical science.

Journal Article

Share this book

Add to My Shelf

CASPER: context-aware scheme for paired-end reads from high-throughput amplicon sequencing

by Yoon, Sungroh , Kwon, Sunyoung , Lee, Byunghan in Algorithms , Ambient intelligence , Assembly

2014

Merging the forward and reverse reads from paired-end sequencing is a critical task that can significantly improve the performance of downstream tasks, such as genome assembly and mapping, by providing them with virtually elongated reads. However, due to the inherent limitations of most paired-end sequencers, the chance of observing erroneous bases grows rapidly as the end of a read is approached, which becomes a critical hurdle for accurately merging paired-end reads. Although there exist several sophisticated approaches to this problem, their performance in terms of quality of merging often remains unsatisfactory. To address this issue, here we present a c ontext- a ware scheme for p aired- e nd r eads (CASPER): a computational method to rapidly and robustly merge overlapping paired-end reads. Being particularly well suited to amplicon sequencing applications, CASPER is thoroughly tested with both simulated and real high-throughput amplicon sequencing data. According to our experimental results, CASPER significantly outperforms existing state-of-the art paired-end merging tools in terms of accuracy and robustness. CASPER also exploits the parallelism in the task of paired-end merging and effectively speeds up by multithreading. CASPER is freely available for academic use at http://best.snu.ac.kr/casper.

Journal Article

Share this book

Add to My Shelf

High-throughput analysis of the activities of xCas9, SpCas9-NG and SpCas9 at matched and mismatched target sequences in human cells

by Huang, Tony P. , Kim, Hyongbum Henry , Min, Seonwoo in 45/41 , 631/1647/1511 , 631/1647/1513/1967/3196

2020

The applications of clustered regularly interspaced short palindromic repeats (CRISPR)-based genome editing can be limited by a lack of compatible protospacer adjacent motifs (PAMs), insufficient on-target activity and off-target effects. Here, we report an extensive comparison of the PAM-sequence compatibilities and the on-target and off-target activities of Cas9 from Streptococcus pyogenes (SpCas9) and the SpCas9 variants xCas9 and SpCas9-NG (which are known to have broader PAM compatibility than SpCas9) at 26,478 lentivirally integrated target sequences and 78 endogenous target sites in human cells. We found that xCas9 has the lowest tolerance for mismatched target sequences and that SpCas9-NG has the broadest PAM compatibility. We also show, on the basis of newly identified non-NGG PAM sequences, that SpCas9-NG and SpCas9 can edit six previously unedited endogenous sites associated with genetic diseases. Moreover, we provide deep-learning models that predict the activities of xCas9 and SpCas9-NG at the target sequences. The resulting deeper understanding of the activities of xCas9, SpCas9-NG and SpCas9 in human cells should facilitate their use. A comparison of compatibilities in protospacer adjacent motifs and of on-target and off-target activities of Streptococcus pyogenes Cas9 variants at endogenous sites in human cells enables the editing of new genomic sites associated with genetic diseases.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter