Catalogue Search | MBRL

Enhancing efficiency of protein language models with minimal wet-lab data through few-shot learning

by Hong, Liang , Zhang, Liang , Tan, Pan in 631/114/1305 , 631/114/2397 , 631/114/2410

2024

Accurately modeling the protein fitness landscapes holds great importance for protein engineering. Pre-trained protein language models have achieved state-of-the-art performance in predicting protein fitness without wet-lab experimental data, but their accuracy and interpretability remain limited. On the other hand, traditional supervised deep learning models require abundant labeled training examples for performance improvements, posing a practical barrier. In this work, we introduce FSFP, a training strategy that can effectively optimize protein language models under extreme data scarcity for fitness prediction. By combining meta-transfer learning, learning to rank, and parameter-efficient fine-tuning, FSFP can significantly boost the performance of various protein language models using merely tens of labeled single-site mutants from the target protein. In silico benchmarks across 87 deep mutational scanning datasets demonstrate FSFP’s superiority over both unsupervised and supervised baselines. Furthermore, we successfully apply FSFP to engineer the Phi29 DNA polymerase through wet-lab experiments, achieving a 25% increase in the positive rate. These results underscore the potential of our approach in aiding AI-guided protein engineering. In this work, the authors proposed a few-shot learning approach that can efficiently optimize protein language models for fitness prediction. It combines the techniques of meta-transfer learning, learning to rank, and parameter-efficient fine-tuning.

Journal Article

Share this book

Add to My Shelf

Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study

by Zhou, Huixue , Xiao, Yongkang , Yang, Han in AI Language Models in Health Care , Artificial Intelligence, Machine Learning, and Natural Language Processing for Public Health , Care and treatment

2025

Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, including medical question-answering (QA). However, individual LLMs often exhibit varying performance across different medical QA datasets. We benchmarked individual zero-shot LLMs (GPT-4, Llama2-13B, Vicuna-13B, MedLlama-13B, and MedAlpaca-13B) to assess their baseline performance. Within the benchmark, GPT-4 achieves the best 71% on MedMCQA (medical multiple-choice question answering dataset), Vicuna-13B achieves 89.5% on PubMedQA (a dataset for biomedical question answering), and MedAlpaca-13B achieves the best 70% among all, showing the potential for better performance across different tasks and highlighting the need for strategies that can harness their collective strengths. Ensemble learning methods, combining multiple models to improve overall accuracy and reliability, offer a promising approach to address this challenge. To develop and evaluate efficient ensemble learning approaches, we focus on improving performance across 3 medical QA datasets through our proposed two ensemble strategies. Our study uses 3 medical QA datasets: PubMedQA (1000 manually labeled and 11,269 test, with yes, no, or maybe answered for each question), MedQA-USMLE (Medical Question Answering dataset based on the United States Medical Licensing Examination; 12,724 English board-style questions; 1272 test, 5 options), and MedMCQA (182,822 training/4183 test questions, 4-option multiple choice). We introduced the LLM-Synergy framework, consisting of two ensemble methods: (1) a Boosting-based Weighted Majority Vote ensemble, refining decision-making by adaptively weighting each LLM and (2) a Cluster-based Dynamic Model Selection ensemble, dynamically selecting optimal LLMs for each query based on question-context embeddings and clustering. Both ensemble methods outperformed individual LLMs across all 3 datasets. Specifically comparing the best individual LLM, the Boosting-based Majority Weighted Vote achieved accuracies of 35.84% on MedMCQA (+3.81%), 96.21% on PubMedQA (+0.64%), and 37.26% (tie) on MedQA-USMLE. The Cluster-based Dynamic Model Selection yields even higher accuracies of 38.01% (+5.98%) for MedMCQA, 96.36% (+1.09%) for PubMedQA, and 38.13% (+0.87%) for MedQA-USMLE. The LLM-Synergy framework, using 2 ensemble methods, represents a significant advancement in leveraging LLMs for medical QA tasks. Through effectively combining the strengths of diverse LLMs, this framework provides a flexible and efficient strategy adaptable to current and future challenges in biomedical informatics.

Journal Article

Share this book

Add to My Shelf

Harnessing protein language model for structure-based discovery of highly efficient and robust PET hydrolases

by Hong, Liang , Zheng, Lirong , Jiang, Shifeng in 119/118 , 631/1647/48 , 631/45/607/1164

2025

Plastic waste, particularly polyethylene terephthalate (PET), presents significant environmental challenges, driving extensive research into enzymatic biodegradation. However, existing PET hydrolases (PETases) are limited by narrow sequence diversity and suboptimal performance. This study introduces VenusMine, a protein discovery pipeline that integrates protein language models (PLMs) with a representation tree to identify PETases based on structural similarity using sequence information. Using the crystal structure of Is PETase as a template, VenusMine identifies and clusters target proteins. Candidates are further screened using PLM-based assessments of solubility and thermostability, leading to the selection of 34 proteins for biochemical validation. Results reveal that 14 candidates exhibit PET degradation activity across 30–60 °C. Notably, a PET hydrolase from Kibdelosporangium banguiense ( Kb PETase) demonstrates a melting temperature (T m ) 32 °C higher than Is PETase and exhibits the highest PET degradation activity within 30 – 65 °C among wild-type PETases. Kb PETase also surpasses FastPETase and LCC in catalytic efficiency. X-ray crystallography and molecular dynamics simulations show that Kb PETase possesses a conserved catalytic domain and enhanced intramolecular interactions, underpinning its improved functionality and thermostability. This work demonstrates a novel deep learning approach for discovering natural PETases with enhanced properties. The environmental threats posed by PET waste have spurred interest in biodegradable technologies. Here, the authors develop VenusMine, a deep learning pipeline, to discover KbPETase, a highly efficient and thermostable PET hydrolase and elucidated the molecular basis underlying its performance

Journal Article

Share this book

Add to My Shelf

Spectrum-Effect Relationship Between Antioxidant and Anti-inflammatory Effects of Banxia Baizhu Tianma Decoction: An Identification Method of Active Substances With Endothelial Cell Protective Effect

by Shi, Haiyan , Li, Mingchen , Xu, Nan in Acids , active substances from natural resource , anti-inflammatory

2022

Banxia Baizhu Tianma decoction (BBTD), a six-herb Chinese medicine formula first described approximately 1732 AD, is commonly prescribed for Hypertension with Phlegm-dampness Stagnation (HPDS) as an adjuvant therapy in China. Obesity is an important risk factor for the increasing prevalence of hypertension year by year in China. In Traditional Chinese medicine, obesity is often differentiated as the syndrome of excessive phlegm-dampness.Vascular endothelial cell injury plays an important role in the development and occurrence of HPDS. In this study, the protective effects of 18 batches of BBTD samples from different origins on HUVEC cells were evaluated, including antioxidant and anti-inflammatory activities. Ultrahigh performance liquid chromatography (UPLC) was used to establish fingerprints, and combined with pharmacodynamic indexes, the protective components of BBTD on endothelial cells were analyzed. Antioxidant and anti-inflammatory activities were evaluated by ROS and Hs-CRP models, respectively. Hierarchical cluster analysis (HCA) and Bivariate correlation analysis (BCA) were used to investigate the potential correlation between chemical components and endothelial cell protection. The results indicated that BBTD could reduce ROS and hs-CRP levels in HUVEC cells, and the pharmacological activities in 18 batches of BBTD samples were significantly different. The results of BCA indicated that Gastrodin, Liquiritin, Hesperidin, Isoliquiritin, Hesperetin, and Isoliquiritigenin might be the active constituents to activate ROS and suppress hs-CRP as determined by spectrum-effect relationships. The antioxidant and anti-inflammatory activities of the 6 components at different concentration were verified, and the results showed that all of them had good antioxidant and anti-inflammatory activities in a concentration-dependent manner. This study showed that activity determination and spectral correlation can be used to search for active substances in Chinese medicine formula and provide data support for quality control of Traditional Chinese medicine (TCM).

Journal Article

Share this book

Add to My Shelf

Optimizing Furniture Retail Strategies: Insights from Cross-Platform Consumer Sentiment and Topic Modeling

by Shi, Yuanyuan , Zhao, Erlong , Li, Mingchen in Artificial intelligence , Cluster analysis , Clustering

2025

Rapid advancements in artificial intelligence and the Internet of Things (IoT) have fueled the growth of furniture, transforming traditional home environments into intelligent living spaces. As consumer adoption accelerates, understanding user concerns and sentiment trends becomes crucial for brands to refine product offerings and enhance market competitiveness. This study systematically investigates consumer concerns and sentiment trends toward furniture products by analyzing user-generated reviews across two major e-commerce platforms: Jingdong and Taobao. Leveraging advanced text-mining methods including TF-IDF keyword extraction, hierarchical clustering, Graph of Words–Latent Dirichlet Allocation (GoW-LDA) topic modeling, and BERT-based sentiment analysis, this research identifies critical user preferences, product satisfaction factors, and platform-specific behavioral patterns. Results reveal distinct cross-platform differences; Jingdong users prioritize service quality, brand trust, and logistical efficiency, whereas Taobao users emphasize product aesthetics, material selection, and cost-effectiveness. The sentiment analysis demonstrates that Jingdong users exhibit more consistent and positive feedback, while sentiment on Taobao displays higher variability due to product-quality discrepancies and price sensitivity.

Journal Article

Share this book

Add to My Shelf

An interval constraint-based trading strategy with social sentiment for the stock market

by Wei, Yunjie , Lin, Wencan , Li, Mingchen in COVID-19 era , Deep learning , Economics

2024

Developing effective strategies to earn excess returns in the stock market is a cutting-edge topic in the field of economics. At the same time, stock price forecasting that supports trading strategies is considered one of the most challenging tasks. Therefore, this study analyzes and extracts news media data, expert comments, social opinion data, and pandemic text data using natural language processing, and then combines the data with a deep learning model to forecast future stock price patterns based on historical stock prices. An interval constraint-based trading strategy is constructed. Using data from several typical stocks in the Chinese stock market during the COVID-19 period, the empirical studies and trading simulations show, first, that the sentiment composite index and the deep learning model can improve the accuracy of stock price forecasting. Second, the interval constraint-based trading strategy based on the proposed approach can effectively enhance returns and thus, can assist investors in decision-making.

Journal Article

Share this book

Add to My Shelf

Gpbar1-mediated SIRT1-PGC-1α axis maintains mitochondrial homeostasis and mitigates renal injury in obstructive jaundice

by Luo, Kai , Chen, Haiyang , Sun, Xu in 631/80/304 , 631/80/86 , 692/4020/4021

2025

Obstructive jaundice (OJ)-induced kidney injury has a high mortality rate, severely affecting patient prognosis. Gpbar1, a bile acid receptor, plays a key role in maintaining tissue homeostasis in organs such as the liver and pancreas during OJ. However, its role in obstructive jaundice-induced kidney injury remains unexplored. This study investigated the protective role of Gpbar1 in OJ-induced kidney injury. Sprague-Dawley rats underwent common bile duct ligation to establish an OJ model. Organ damage was evaluated by pathological examination, TUNEL staining, and liver/kidney function tests to assess both OJ model establishment and kidney injury. Mitochondrial function changes were assessed through electron microscopy, SOD, MDA, GSH and ATP detection. Immunohistochemistry and Western blot were used to assess Gpbar1, SIRT1, and PGC-1α expression. HK-2 cells were treated with deoxycholic acid to establish a renal tubular epithelial cell injury model. Lentiviral vectors were used to overexpress or knock down Gpbar1, combined with interventions using SIRT1 and PGC-1α agonists and inhibitors. The Gpbar1-SIRT1-PGC-1α axis was validated by qRT-PCR and WB. The protective role of the Gpbar1-SIRT1-PGC-1α axis in OJ-induced kidney injury was studied using CCK-8, transmission electron microscopy, ROS detection, and mitochondrial membrane potential assays. In the rat OJ model, the model group exhibited injury-related pathological changes compared to control group. Liver and kidney function markers and TUNEL-positive cells significantly increased, and structural and functional damage in the kidneys occurred. Mitochondrial structural disorder occurred in the kidneys, with significant reductions in SOD, GSH, and ATP levels, while MDA levels were significantly increased, indicating impaired antioxidant capacity and energy metabolism dysfunction. IHC, WB, and qRT-PCR revealed that protein and mRNA levels of Gpbar1, SIRT1, and PGC-1α in kidney tissues were lower in the model group. In the cellular model, DCA treatment and Gpbar1 knockdown significantly reduced cell viability, caused mitochondrial structural disorder, increased ROS levels and decreased JC-1 ratio, while Gpbar1 overexpression reversed these changes. After treatment with the SIRT1 inhibitor EX527, PGC-1α expression significantly decreased. We used SIRT1 inhibitors, activators and PGC-1α inhibitors to conduct positive and negative regulation experiments and confirmed the hierarchical regulatory effect of Gpbar1 on SIRT1-PGC-1α. Gpbar1 influences oxidative stress resistance via the SIRT1-PGC-1α axis, promotes mitochondrial functional homeostasis, and alleviates kidney injury induced by obstructive jaundice.

Journal Article

Share this book

Add to My Shelf

SESNet: sequence-structure feature-integrated deep learning method for data-efficient protein engineering

by Xiong, Yi , Fan, Guisheng , Hong, Liang in Amino acid sequence , Chemistry , Chemistry and Materials Science

2023

Deep learning has been widely used for protein engineering. However, it is limited by the lack of sufficient experimental data to train an accurate model for predicting the functional fitness of high-order mutants. Here, we develop SESNet, a supervised deep-learning model to predict the fitness for protein mutants by leveraging both sequence and structure information, and exploiting attention mechanism. Our model integrates local evolutionary context from homologous sequences, the global evolutionary context encoding rich semantic from the universal protein sequence space and the structure information accounting for the microenvironment around each residue in a protein. We show that SESNet outperforms state-of-the-art models for predicting the sequence-function relationship on 26 deep mutational scanning datasets. More importantly, we propose a data augmentation strategy by leveraging the data from unsupervised models to pre-train our model. After that, our model can achieve strikingly high accuracy in prediction of the fitness of protein mutants, especially for the higher order variants (> 4 mutation sites), when finetuned by using only a small number of experimental mutation data (< 50). The strategy proposed is of great practical value as the required experimental effort, i.e., producing a few tens of experimental mutation data on a given protein, is generally affordable by an ordinary biochemical group and can be applied on almost any protein.

Journal Article

Share this book

Add to My Shelf

PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications

by Fan, Guisheng , Hong, Liang , Tan, Yang in Algorithms , Amino acid sequence , Amino acids

2024

Protein language models (PLMs) play a dominant role in protein representation learning. Most existing PLMs regard proteins as sequences of 20 natural amino acids. The problem with this representation method is that it simply divides the protein sequence into sequences of individual amino acids, ignoring the fact that certain residues often occur together. Therefore, it is inappropriate to view amino acids as isolated tokens. Instead, the PLMs should recognize the frequently occurring combinations of amino acids as a single token. In this study, we use the byte-pair-encoding algorithm and unigram to construct advanced residue vocabularies for protein sequence tokenization, and we have shown that PLMs pre-trained using these advanced vocabularies exhibit superior performance on downstream tasks when compared to those trained with simple vocabularies. Furthermore, we introduce PETA, a comprehensive benchmark for systematically evaluating PLMs. We find that vocabularies comprising 50 and 200 elements achieve optimal performance. Our code, model weights, and datasets are available at https://github.com/ginnm/ProteinPretraining . Scientific contribution This study introduces advanced protein sequence tokenization analysis, leveraging the byte-pair-encoding algorithm and unigram. By recognizing frequently occurring combinations of amino acids as single tokens, our proposed method enhances the performance of PLMs on downstream tasks. Additionally, we present PETA, a new comprehensive benchmark for the systematic evaluation of PLMs, demonstrating that vocabularies of 50 and 200 elements offer optimal performance. Graphical Abstract

Journal Article

Share this book

Add to My Shelf

Hot central-plant recycling technology: A systematic review on raw materials and performance-influencing factors

by Xing, Chengwei , Chang, Zhibin , Li, Mingchen in Aggregates , Asphalt mixes , Asphalt pavements

2025

The hot central-plant recycling (HCPR) technology has been widely concerned by researchers in pavement engineering because of its excellent economic benefits and positive environmental outcomes. Over the last few years, the application of HCPR technology in highway construction and maintenance has increasingly expanded. However, despite this wider adoption, critical issues concerning the composition of raw materials and performance-influencing factors of hot central-plant recycled asphalt mixtures (HCPRAM) necessitate careful consideration and deeper understanding. Therefore, conducting a detailed interpretation and systematic analysis of each component material and performance-influencing factors holds tremendous significance for advancing the design methodology, optimizing the production process, and enhancing the overall quality and durability of HCPRAM. This paper comprehensively reviews the current state-of-the-art research pertaining to the raw material composition and the crucial factors affecting the road performance of HCPRAM. Firstly, the functionality of recycled asphalt pavement (RAP) materials, virgin asphalt, virgin aggregates, rejuvenators, and fibers in the mixtures are introduced. Then, the influencing factors of the performance of HRAM are described in detail from both internal and external factors. Finally, the paper further discusses persistent challenges and knowledge gaps identified in the current research landscape of HCPR technology. Based on this critical analysis, pertinent recommendations are suggested to guide productive avenues for future research and development.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter