Catalogue Search | MBRL

Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models

by Zhebrak, Alexander , Artamonov, Aleksey , Tatanov, Oktai in benchmark , Datasets , Deep learning

2020

Generative models are becoming a tool of choice for exploring the molecular space. These models learn on a large training dataset and produce novel molecular structures with similar properties. Generated structures can be utilized for virtual screening or training semi-supervized predictive models in the downstream tasks. While there are plenty of generative models, it is unclear how to compare and rank them. In this work, we introduce a benchmarking platform called Molecular Sets (MOSES) to standardize training and comparison of molecular generative models. MOSES provides training and testing datasets, and a set of metrics to evaluate the quality and diversity of generated structures. We have implemented and compared several molecular generation models and suggest to use our results as reference points for further advancements in generative chemistry research. The platform and source code are available at https://github.com/molecularsets/moses .

Journal Article

Share this book

Add to My Shelf

Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework

by Khrabrov, Kuzma , Kadurin, Artur , Ganeeva, Veronika in Augmentations , Benchmarks , Chemistry

2025

The recent integration of natural language processing into chemistry has advanced drug discovery. Molecule representations in language models (LMs) are crucial to enhance chemical understanding. We explored the ability of models to match the same chemical structures despite their different representations. Recognizing the same substance in different representations is an important component of emulating the understanding of how chemistry works. We propose Augmented Molecular Retrieval (AMORE), a flexible zero-shot framework for the assessment of chemistry LMs of different natures. The framework is based on SMILES augmentations that maintain a foundational chemical structure. The proposed method facilitates the similarity between the embedding representations of the molecule, its SMILES variation, and that of another molecule. Experiments indicate that the tested ChemLLMs are still not robust to different SMILES representations. We evaluated the models on various tasks, including the molecular captioning on ChEBI-20 benchmark and classification and regression tasks of MoleculeNet benchmark. We show that the results’ change after SMILES strings variations align with the proposed AMORE framework. Scientific contribution We present AMORE, a framework for evaluating chemical language models (ChemLMs) based on their inner embedding space. AMORE uses augmentations that reformulate SMILES strings of molecule structures to assess chemical representations. The proposed framework allows evaluation of ChemLMs without expensive manually annotated data.

Journal Article

Share this book

Add to My Shelf

Molecular Generation for Desired Transcriptome Changes With Adversarial Autoencoders

by Zhebrak, Alexander , Shayakhmetov, Rim , Aliper, Alexander in adversarial autoencoders , conditional generation , Datasets

2020

Gene expression profiles are useful for assessing the efficacy and side effects of drugs. In this paper, we propose a new generative model that infers drug molecules that could induce a desired change in gene expression. Our model-the Bidirectional Adversarial Autoencoder-explicitly separates cellular processes captured in gene expression changes into two feature sets: those and to the drug incubation. The model uses features to produce a drug hypothesis. We have validated our model on the LINCS L1000 dataset by generating molecular structures in the SMILES format for the desired transcriptional response. In the experiments, we have shown that the proposed model can generate novel molecular structures that could induce a given gene expression change or predict a gene expression difference after incubation of a given molecular structure. The code of the model is available at https://github.com/insilicomedicine/BiAAE.

Journal Article

Share this book

Add to My Shelf

Doping position estimation for FeRh-based alloys

by Rumiantsev, Egor , Peresypkin, Nikita D. , Tsypin, Artem in 639/638/298 , 639/638/298/920 , 639/638/563/606

2024

FeRh-based alloys have attracted significant attention due to their magnetic phase transition and significant magnetocaloric effects. These properties position them as promising candidates for fundamental research and practical applications, including magnetic cooling and targeted drug delivery. The study of FeRh alloys, particularly those where Rhodium or Iron atoms are substituted with other transition metals, is crucial as certain substitutions preserve the alloy’s magnetocaloric properties. However, even within a specific structural type and without considering competing phases, determining which atom (Fe or Rh) is replaced upon introducing a third element remains unclear. This paper addresses this ambiguity through ab initio calculations. We propose an approach to predict whether a dopant will replace Fe or Rh, offering insights into the electronic and structural factors influencing the substitution. Additionally, we present a dataset of ab initio calculations on doped FeRh alloys, which will support future data-driven modeling efforts. Our findings not only advance the understanding of FeRh-based alloys but also contribute to the design of novel materials for experimental and industrial applications.

Journal Article

Share this book

Add to My Shelf

LAGNet: better electron density prediction for LCAO-based data and drug-like substances

by Rumiantsev, Egor , Ushenin, Konstantin , Khrabrov, Kuzma in AI in Drug Discovery , Chemistry , Chemistry and Materials Science

2025

The electron density is an important object in quantum chemistry that is crucial for many downstream tasks in drug design. Recent deep learning approaches predict the electron density around a molecule from atom types and atom positions. Most of these methods use the plane wave (PW) numerical method as a source of ground-truth training data. However, the drug design field mostly uses the Linear Combination of Atomic Orbitals (LCAO) for computation of quantum properties. In this study, we focus on prediction of the electron density for drug-like substances and training neural networks with LCAO-based datasets. Our experiments show that proper handling of large amplitudes of core orbitals is crucial for training on LCAO-based data. We propose to store the electron density with the standard grids instead of the uniform grid. This allowed us to reduce the number of probing points per molecule by 43 times and reduce storage space requirements by 8 times. Finally, we propose a novel architecture based on the DeepDFT model that we name LAGNet. It is specifically designed and tuned for drug-like substances and ∇ 2 DFT dataset. Scientific contribution We propose a core suppression model to correctly handle core orbitals and train neural network on LCAO-based data with atoms of the 3rd and 4th periods. We show that using the standard grid instead of the uniform grid drastically reduces the number of electron density probing points and data storage requirements. Finally, we propose the LAGNet model that allows to get better results on drug-like substances than the equivariant DeepDFT model. Graphical abstract

Journal Article

Share this book

Add to My Shelf

A conformational benchmark for optical property prediction with solvent-aware graph neural networks

by Potapov, Denis , Ushenin, Konstantin , Tsypin, Artem in 639/638/439/945 , 639/638/630 , Accuracy

2026

Accurately predicting optical spectra of molecules is essential for creating better OLED emitters, solar-cell dyes, and fluorescent probes. Traditional methods, such as time-dependent density-functional theory, are computationally expensive and often inaccurate. Current Graph Neural Network (GNN) approaches for optical properties prediction are faster and offer better performance. Still, they operate on 2D graphs and ignore the 3D geometrical features that control excited-state behavior. We present nablaColors-3D, a rigorously curated dataset for the prediction of optical properties consisting of 26369 chromophore-solvent pairs with three conformations optimized at different levels of quantum theory. Based on this dataset, we establish a scaffold-split benchmark for 3D GNNs and systematically quantify how the fidelity of geometry optimization affects accuracy. Furthermore, we propose a solvent-aware modification for pretrained SE(3)-invariant architectures. Our best model, built on UniMol+, achieves MAE of 15.97 nm on a held-out test set, improving the previous state of the art by more than 30%. Current graph neural networks (GNNs) for the prediction of optical properties in molecules operate on 2D graphs, potentially overlooking 3D geometrical features underlying excited-state behaviour. Here, the authors present nablaColors, a curated dataset for the prediction of optical properties consisting of 26,369 chromophore-solvent pairs with three conformations optimized at different levels of theory, establishing a scaffold-split benchmark for 3D GNNs, and propose a solvent-aware modification for pretrained SE(3)-invariant GNN architectures.

Journal Article

Share this book

Add to My Shelf

Addendum: Molecular Generation for Desired Transcriptome Changes With Adversarial Autoencoders

by Zhebrak, Alexander , Shayakhmetov, Rim , Aliper, Alexander in adversarial autoencoders , conditional generation , deep learning

2020

Journal Article

Share this book

Add to My Shelf

Erratum: Addendum: Molecular Generation for Desired Transcriptome Changes With Adversarial Autoencoders

by Zhebrak, Alexander , Shayakhmetov, Rim , Aliper, Alexander

2020

[This corrects the article .].[This corrects the article .].

Journal Article

Share this book

Add to My Shelf

$(\\nabla^2\\)DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials$

(\\nabla^2\\)DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials

by Rumiantsev, Egor , Ushenin, Konstantin , Tutubalina, Elena in Benchmarks , Chemistry , Datasets

2024

Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets for training. This work presents a new dataset and benchmark called \$\\nabla^2\$DFT that is based on the nablaDFT. It contains twice as much molecular structures, three times more conformations, new data types and tasks, and state-of-the-art models. The dataset includes energies, forces, 17 molecular properties, Hamiltonian and overlap matrices, and a wavefunction object. All calculations were performed at the DFT level (\$\\omega\$B97X-D/def2-SVP) for each conformation. Moreover, \$\\nabla^2\$DFT is the first dataset that contains relaxation trajectories for a substantial number of drug-like molecules. We also introduce a novel benchmark for evaluating NNPs in molecular property prediction, Hamiltonian prediction, and conformational optimization tasks. Finally, we propose an extendable framework for training NNPs and implement 10 models within it.

Paper

Share this book

Add to My Shelf

Drug and Disease Interpretation Learning with Biomedical Entity Representation Transformer

by Miftahutdinov, Zulfat , Tutubalina, Elena , Kudrin, Roman in Clinical trials , Coders , Data mining

2021

Concept normalization in free-form texts is a crucial step in every text-mining pipeline. Neural architectures based on Bidirectional Encoder Representations from Transformers (BERT) have achieved state-of-the-art results in the biomedical domain. In the context of drug discovery and development, clinical trials are necessary to establish the efficacy and safety of drugs. We investigate the effectiveness of transferring concept normalization from the general biomedical domain to the clinical trials domain in a zero-shot setting with an absence of labeled data. We propose a simple and effective two-stage neural approach based on fine-tuned BERT architectures. In the first stage, we train a metric learning model that optimizes relative similarity of mentions and concepts via triplet loss. The model is trained on available labeled corpora of scientific abstracts to obtain vector embeddings of concept names and entity mentions from texts. In the second stage, we find the closest concept name representation in an embedding space to a given clinical mention. We evaluated several models, including state-of-the-art architectures, on a dataset of abstracts and a real-world dataset of trial records with interventions and conditions mapped to drug and disease terminologies. Extensive experiments validate the effectiveness of our approach in knowledge transfer from the scientific literature to clinical trials.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter