Catalogue Search | MBRL

BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale

by Wei, Chih-Hsuan , Chen, Qingyu , Lu, Zhiyong in Bioinformatics , Biology and Life Sciences , Biotechnology

2020

A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings-which involve the learning of vector representations of concepts using machine learning models-have been employed to capture the semantics of concepts. To develop concept embeddings, named-entity recognition (NER) tools are first used to identify and normalize concepts from the literature, and then different machine learning models are used to train the embeddings. Despite multiple attempts, existing biomedical concept embeddings generally suffer from suboptimal NER tools, small-scale evaluation, and limited availability. In response, we employed high-performance machine learning-based NER tools for concept recognition and trained our concept embeddings, BioConceptVec, via four different machine learning models on ~30 million PubMed abstracts. BioConceptVec covers over 400,000 biomedical concepts mentioned in the literature and is of the largest among the publicly available biomedical concept embeddings to date. To evaluate the validity and utility of BioConceptVec, we respectively performed two intrinsic evaluations (identifying related concepts based on drug-gene and gene-gene interactions) and two extrinsic evaluations (protein-protein interaction prediction and drug-drug interaction extraction), collectively using over 25 million instances from nine independent datasets (17 million instances from six intrinsic evaluation tasks and 8 million instances from three extrinsic evaluation tasks), which is, by far, the most comprehensive to our best knowledge. The intrinsic evaluation results demonstrate that BioConceptVec consistently has, by a large margin, better performance than existing concept embeddings in identifying similar and related concepts. More importantly, the extrinsic evaluation results demonstrate that using BioConceptVec with advanced deep learning models can significantly improve performance in downstream bioinformatics studies and biomedical text-mining applications. Our BioConceptVec embeddings and benchmarking datasets are publicly available at https://github.com/ncbi-nlp/BioConceptVec.

Journal Article

Share this book

Add to My Shelf

Artificial intelligence enables precision diagnosis of cervical cytology grades and cervical cancer

by Lin, Zhen , Chen, Yongjian , Yao, Herui in 119/118 , 631/114/1305 , 631/114/2397

2024

Cervical cancer is a significant global health issue, its prevalence and prognosis highlighting the importance of early screening for effective prevention. This research aimed to create and validate an artificial intelligence cervical cancer screening (AICCS) system for grading cervical cytology. The AICCS system was trained and validated using various datasets, including retrospective, prospective, and randomized observational trial data, involving a total of 16,056 participants. It utilized two artificial intelligence (AI) models: one for detecting cells at the patch-level and another for classifying whole-slide image (WSIs). The AICCS consistently showed high accuracy in predicting cytology grades across different datasets. In the prospective assessment, it achieved an area under curve (AUC) of 0.947, a sensitivity of 0.946, a specificity of 0.890, and an accuracy of 0.892. Remarkably, the randomized observational trial revealed that the AICCS-assisted cytopathologists had a significantly higher AUC, specificity, and accuracy than cytopathologists alone, with a notable 13.3% enhancement in sensitivity. Thus, AICCS holds promise as an additional tool for accurate and efficient cervical cancer screening. Cervical screening is a key method for detecting cervical cancer, but is limited by pathologist detection. Here, the authors use artificial intelligence to predict cytology grades from whole slide images.

Journal Article

Share this book

Add to My Shelf

Benchmarking large language models for biomedical natural language processing applications and recommendations

by Gilson, Aidan , Xu, Hua , Hu, Yan in 631/114/2164 , 692/700 , Benchmarking

2025

The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remains unclear due to limited benchmarks and practical guidelines. We perform a systematic evaluation of four LLMs—GPT and LLaMA representatives—on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with the traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here, we show that traditional fine-tuning outperforms zero- or few-shot LLMs in most tasks. However, closed-source LLMs like GPT-4 excel in reasoning-related tasks such as medical question answering. Open-source LLMs still require fine-tuning to close performance gaps. We find issues like missing information and hallucinations in LLM outputs. These results offer practical insights for applying LLMs in BioNLP. Baseline performance, benchmarks, and guidance for LLMs in biomedicine are limited. The authors assess four LLMs on 12 tasks, establish baselines, examine hallucinations, and provide recommendations for optimal LLM use.

Journal Article

Share this book

Add to My Shelf

AgentMD: Empowering language agents for risk prediction with large-scale clinical tool learning

by Jin, Qiao , Wright, Donald , Lu, Zhiyong in 692/499 , 692/699 , Accuracy

2025

Clinical calculators play a vital role in healthcare, but their utilization is often hindered by usability and dissemination challenges. We introduce AgentMD, a novel language agent capable of curating and applying clinical calculators across various clinical contexts. As a tool builder, AgentMD first uses PubMed to curate a diverse set of 2,164 executable clinical calculators with over 85% accuracy for quality checks and over 90% pass rate for unit tests. As a tool user, AgentMD autonomously selects and applies the relevant clinical calculators. Our evaluations show that AgentMD significantly outperforms GPT-4 for risk prediction (87.7% vs. 40.9% in accuracy). Results on 698 real-world emergency department notes confirm that AgentMD accurately computes medical risks at an individual level. Moreover, AgentMD can provide population-level insights for institutional risk management. Our study illustrates the capabilities of language agents to curate and utilize clinical calculators for both individual patient care and at-scale healthcare analytics. Clinical calculators play a vital role in healthcare, but their utilization remains to be optimized. The authors present AgentMD, an AI agent that can autonomously curate clinical calculators from medical literature and apply them to various use cases.

Journal Article

Share this book

Add to My Shelf

Applicability of Susceptibility Model for Rock and Loess Earthquake Landslides in the Eastern Tibetan Plateau

by Chen, Qingyu , Zhang, Wenyuan , Du, Jie in Accuracy , China , Cognition

2021

It is crucial to explore a suitable landslide susceptibility model with an excellent prediction capability for rapid evaluation and disaster relief in seismic regions with different lithological features. In this study, we selected two typical seismic events, the Jiuzhaigou and Minxian earthquakes, which occurred in the Alpine karst and loess regions, respectively. Eight influencing factors and five models were chosen to calculate the susceptibility of landslide, including the information (I) model, certainty factor (CF) model, logistic regression (LR) model, I + LR coupling model, and CF + LR coupling model. Then, the accuracy and the landslide susceptibility distribution of these models were assessed by the area under curve (AUC) and distribution criteria. Finally, the model with high accuracy and good applicability for the rock landslide or loess landslide regions was optimized. Our results showed that the accuracy of the coupling model is higher than that of the single models. Except for the LR model, the landslide susceptibility distribution for the above-mentioned models is consistent with universal cognition. The coupling models are generally better than their single models. Among them, the I + LR model can obtain the best comprehensive results for assessing the distribution and accuracy of both rock and loess landslide susceptibility, which is helpful for disaster relief and policy-making, and it can also provide useful scientific data for post-seismic reconstruction and restoration.

Journal Article

Share this book

Add to My Shelf

Ascle—A Python Natural Language Processing Toolkit for Medical Text Generation: Development and Evaluation Study

by Keenan, Tiarnan , Hsieh, Chia-Chun , Xu, Hua in Algorithms , Analysis , Answers

2024

Medical texts present significant domain-specific challenges, and manually curating these texts is a time-consuming and labor-intensive process. To address this, natural language processing (NLP) algorithms have been developed to automate text processing. In the biomedical field, various toolkits for text processing exist, which have greatly improved the efficiency of handling unstructured text. However, these existing toolkits tend to emphasize different perspectives, and none of them offer generation capabilities, leaving a significant gap in the current offerings. This study aims to describe the development and preliminary evaluation of Ascle. Ascle is tailored for biomedical researchers and clinical staff with an easy-to-use, all-in-one solution that requires minimal programming expertise. For the first time, Ascle provides 4 advanced and challenging generative functions: question-answering, text summarization, text simplification, and machine translation. In addition, Ascle integrates 12 essential NLP functions, along with query and search capabilities for clinical databases. We fine-tuned 32 domain-specific language models and evaluated them thoroughly on 27 established benchmarks. In addition, for the question-answering task, we developed a retrieval-augmented generation (RAG) framework for large language models that incorporated a medical knowledge graph with ranking techniques to enhance the reliability of generated answers. Additionally, we conducted a physician validation to assess the quality of generated content beyond automated metrics. The fine-tuned models and RAG framework consistently enhanced text generation tasks. For example, the fine-tuned models improved the machine translation task by 20.27 in terms of BLEU score. In the question-answering task, the RAG framework raised the ROUGE-L score by 18% over the vanilla models. Physician validation of generated answers showed high scores for readability (4.95/5) and relevancy (4.43/5), with a lower score for accuracy (3.90/5) and completeness (3.31/5). This study introduces the development and evaluation of Ascle, a user-friendly NLP toolkit designed for medical text generation. All code is publicly available through the Ascle GitHub repository. All fine-tuned language models can be accessed through Hugging Face.

Journal Article

Share this book

Add to My Shelf

Medical foundation large language models for comprehensive text analysis and beyond

by Xu, Hua , Bian, Jiang , Hu, Yan in 631/114/1305 , 631/114/1314 , 631/114/2164

2025

Recent advancements in large language models (LLMs) show significant potential in medical applications but are hindered by limited specialized medical knowledge. We present Me-LLaMA, a family of open-source medical LLMs integrating extensive domain-specific knowledge with robust instruction-following capabilities. Me-LLaMA is developed through continual pretraining and instruction tuning of LLaMA2 models using diverse biomedical and clinical data sources (e.g., biomedical literature and clinical notes). We evaluated Me-LLaMA on six text analysis tasks using 12 benchmarks (e.g., PubMedQA and MIMIC-CXR) and assessed its clinical utility in complex case diagnosis through automatic and human evaluations. Me-LLaMA outperforms existing open medical LLMs in zero-shot and supervised settings and surpasses ChatGPT and GPT-4 after task-specific instruction tuning for most text analysis tasks. Its performance is also comparable to ChatGPT and GPT-4 for diagnosing complex clinical cases. Our findings highlight the importance of combining domain-specific continual pretraining with instruction tuning to enhance performance in medical LLMs.

Journal Article

Share this book

Add to My Shelf

Endocrine disruptors reprogram hepatic metabolic and immune gene networks to promote hepatocellular carcinoma

by Wang, Yi , Chen, Qingyu , Wang, Yirong in 631/114 , 631/208 , 631/337

2026

Endocrine-disrupting chemicals (EDCs) are increasingly recognized as environmental contributors to hepatocellular carcinoma (HCC), yet their molecular mechanisms remain poorly understood. This study integrates toxicogenomic, transcriptomic, genetic, and single-cell RNA sequencing data to elucidate how EDCs reprogram hepatic metabolic and immune networks to promote tumorigenesis. By intersecting 5797 EDC-responsive genes with 946 HCC differentially expressed genes, 513 overlapping candidates were identified, enriched in pathways involving hormone signaling, xenobiotic metabolism, lipid regulation, and inflammation. Genetic evidence supported five genes ( ESR1 , TP53I3 , PLIN2 , SLC6A12 , and SOCS2 ) as key determinants of HCC susceptibility. These genes exhibited experimentally supported interactions with multiple EDCs, including bisphenol A, diethylhexyl phthalate, and cadmium chloride, implicating them as convergent molecular targets of environmental exposures. Single-cell transcriptomic analysis revealed cell-type-specific expression, notably SOCS2 in endothelial cells and PLIN2 in myeloid populations, while ESR1 displayed sex-dimorphic expression patterns consistent with disrupted estrogen signaling in female HCC. These findings indicate that chronic EDC exposure perturbs hormonal, metabolic, and immune homeostasis, driving hepatic carcinogenesis through coordinated gene network reprogramming. The integrative multi-omics framework presented here provides novel mechanistic insight into the environmental etiology of liver cancer and identifies candidate biomarkers for exposure-linked prevention strategies.

Journal Article

Share this book

Add to My Shelf

Direct Generation and Non-Hermitian Regulation of Energy-Time-Polarization-Hyper-Entangled Quadphotons

by Liu, Guobin , Zhang, Siqiang , Zhang, Yanpeng in atomic optics , Atoms , Atoms & subatomic particles

2025

Entangled multiphoton is an ideal resource for quantum information technology. Here, narrow-bandwidth hyper-entangled quadphoton is theoretically demonstrated by quantizing degenerate Zeeman sub states through spontaneous eight-wave mixing (EWM) in a hot 85Rb. Polarization-based energy-time entanglement (output) under multiple polarized dressings is presented in detail with uncorrelated photons and Raman scattering suppressed. High-dimensional entanglement is contrived by passive non-Hermitian characteristic, and EWM-based quadphoton is genuine quadphoton with quadripartite entanglement. High quadphoton production rate is achieved from co-action of four strong input fields, and electromagnetically induced transparency (EIT) slow light effect. Atomic passive non-Hermitian characteristic provides the system with acute coherent tunability around exceptional points (EPs). The results unveil multiple coherent channels (~8) inducing oscillations with multiple periods (~19) in quantum correlations, and high-dimensional (~8) four-body entangled quantum network (capacity ~65536). Coexistent hyper and high-dimensional entanglements facilitate high quantum information capacity. The system can be converted among three working states under regulating passive non-Hermitian characteristic via triple polarized dressing. The research provides a promising approach for applying hyper-entangled multiphoton to tunable quantum networks with high information capacity, whose multi-partite entanglement and multiple-degree-of-freedom properties help optimize the accuracy of quantum sensors.

Journal Article

Share this book

Add to My Shelf

Surface Deformation Associated with the 22 August 1902 Mw 7.7 Atushi Earthquake in the Southwestern Tian Shan, Revealed from Multiple Remote Sensing Data

by Chen, Qingyu , Shi, Pilong , Fu, Bihong in 1902 Mw 7.7 Atushi earthquake , Arid regions , Arid zones

2022

The 22 August 1902 Mw 7.7 Atushi earthquake is the most disastrous seismic event in the southwestern Tian Shan. However, the spatial distribution of surface rupture zones as well as the geometric feature of surface deformation remain unclear, and the seismogenic fault is still controversial. Based on geologic and geomorphic interpretations of multiple remote sensing imaging data, high-resolution DEM data derived from UAV imaging complemented by field investigations, we mapped two sub-parallel NEE-trending surface rupture zones with a total length of 108 km. In addition, ~60 km and ~48 km surface rupture zones are distributed along the pre-existing Atushi fault (ATF) and the Keketamu fault (KTF), respectively. The surface deformations are mainly characterized as bedrock scarp, hanging wall collapse scarp, pressure ridge, and thrust-related fold scarps along the two south-dipping thrust faults, which are defined as the seismogenic structure of the 1902 Mw 7.7 Atushi earthquake. Thus, we proposed the cascading-rupture model to explain the multiple rupture zones generated by the 1902 Mw 7.7 Atushi earthquake. Moreover, the multiple advanced remote sensing mapping techniques can provide a promising approach to recover the geometric and geomorphic features of the surface deformation caused by large seismic events in the arid and semi-arid regions.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter