Catalogue Search | MBRL

DrugBERT: a BERT-based approach integrating LDA topic embedding and efficacy-aware mechanism for predicting anti-tumor drug efficacy

by Wang, Hongqiang , Xie, Xinping , Jiang, Xiaodong in Accuracy , Algorithms , Antimitotic agents

2025

Background Due to the complexity of tumor genetic heterogeneity, personalized medicine has progressively emerged as the central focus of cancer research. However, how to accurately predict the drug response of patients before receiving treatment is the critical challenge to the development of this field. Methods This paper proposes DrugBERT, a BERT-based framework integrated with LDA topic embedding and a drug efficacy-aware mechanism for predicting the efficacy of antitumor drugs. The method incorporates LDA-generated topic embedding as a semantic enhancement module into the BERT language model and introduces a drug efficacy-aware attention mechanism to prioritize drug efficacy-related semantic features. The model is via LSTM to capture long-range dependencies in clinical text data. In addition, the SMOTE algorithm is used to synthesize samples of the minority class to solve the problem of data imbalance. Results The proposed method DrugBERT demonstrated remarkable performance on a dataset of 958 patients with non-small cell cancer treated with antitumor drugs. Furthermore, when validated on an independent dataset of 266 bowel cancer patients, the model achieved a 3% improvement in AUC over previous methods, signifying its robust generalization capability. Conclusions DrugBERT can help predict the efficacy of antitumor drugs based on clinical text while exhibiting strong generalization capability. These findings highlight its potential for optimizing personalized therapeutic strategies through language model.

Journal Article

Share this book

Add to My Shelf

Automatic Topic Title Assignment with Word Embedding

by Romano, Maurizio , Zammarchi, Gianpaolo , Conversano, Claudio in Appropriateness , Assignment , Bioinformatics

2024

In this paper, we propose TAWE (title assignment with word embedding), a new method to automatically assign titles to topics inferred from sets of documents. This method combines the results obtained from the topic modeling performed with, e.g., latent Dirichlet allocation (LDA) or other suitable methods and the word embedding representation of words in a vector space. This representation preserves the meaning of the words while allowing to find the most suitable word that represents the topic. The procedure is twofold: first, a cleaned text is used to build the LDA model to infer a desirable number of latent topics; second, a reasonable number of words and their weights are extracted from each topic and represented in n-dimensional space using word embedding. Based on the selected weighted words, a centroid is computed, and the closest word is chosen as the title of the topic. To test the method, we used a collection of tweets about climate change downloaded from some of the main newspapers accounts on Twitter. Results showed that TAWE is a suitable method for automatically assigning a topic title.

Journal Article

Share this book

Add to My Shelf

Structural–Semantic Term Weighting for Interpretable Topic Modeling with Higher Coherence and Lower Token Overlap

by Konnikov, Evgenii , Yakob, Polina , Golikov, Gleb in Bibliometrics , Coherence , coherence value

2026

Topic modeling of large news streams is widely used to reconstruct economic and political narratives, which requires coherent topics with low lexical overlap while remaining interpretable to domain experts. We propose TF-SYN-NER-Rel, a structural–semantic term weighting scheme that extends classical TF-IDF by integrating positional, syntactic, factual, and named-entity coefficients derived from morphosyntactic and dependency parses of Russian news texts. The method is embedded into a standard Latent Dirichlet Allocation (LDA) pipeline and evaluated on a large Russian-language news corpus from the online archive of Moskovsky Komsomolets (over 600,000 documents), with political, financial, and sports subsets obtained via dictionary-based expert labeling. For each subset, TF-SYN-NER-Rel is compared with standard TF-IDF under identical LDA settings, and topic quality is assessed using the C_v coherence metric. To assess robustness, we repeat model training across multiple random initializations and report aggregate coherence statistics. Quantitative results show that TF-SYN-NER-Rel improves coherence and yields smoother, more stable coherence curves across the number of topics. Qualitative analysis indicates reduced lexical overlap between topics and clearer separation of event-centered and institutional themes, especially in political and financial news. Overall, the proposed pipeline relies on CPU-based NLP tools and sparse linear algebra, providing a computationally lightweight and interpretable complement to embedding- and LLM-based topic modeling in large-scale news monitoring.

Journal Article

Share this book

Add to My Shelf