Catalogue Search | MBRL

A comparative study of an on premise AutoML solution for medical image classification

by Elangovan, Kabilan , Lim, Gilbert , Ting, Daniel in 631/67/1813 , 631/67/2321 , 692/699/1785

2024

Automated machine learning (AutoML) allows for the simplified application of machine learning to real-world problems, by the implicit handling of necessary steps such as data pre-processing, feature engineering, model selection and hyperparameter optimization. This has encouraged its use in medical applications such as imaging. However, the impact of common parameter choices such as the number of trials allowed, and the resolution of the input images, has not been comprehensively explored in existing literature. We therefore benchmark AutoKeras (AK), an open-source AutoML framework, against several bespoke deep learning architectures, on five public medical datasets representing a wide range of imaging modalities. It was found that AK could outperform the bespoke models in general, although at the cost of increased training time. Moreover, our experiments suggest that a large number of trials and higher resolutions may not be necessary for optimal performance to be achieved.

Journal Article

Share this book

Add to My Shelf

Comparison of the Quality of Discharge Letters Written by Large Language Models and Junior Clinicians: Single-Blinded Study

by Lim, Daniel Yan Zheng , Tan, Ting Fang , Ong, Jasmine Chiat Ling in Archives & records , Artificial intelligence , Biopsy

2024

Discharge letters are a critical component in the continuity of care between specialists and primary care providers. However, these letters are time-consuming to write, underprioritized in comparison to direct clinical care, and are often tasked to junior doctors. Prior studies assessing the quality of discharge summaries written for inpatient hospital admissions show inadequacies in many domains. Large language models such as GPT have the ability to summarize large volumes of unstructured free text such as electronic medical records and have the potential to automate such tasks, providing time savings and consistency in quality. The aim of this study was to assess the performance of GPT-4 in generating discharge letters written from urology specialist outpatient clinics to primary care providers and to compare their quality against letters written by junior clinicians. Fictional electronic records were written by physicians simulating 5 common urology outpatient cases with long-term follow-up. Records comprised simulated consultation notes, referral letters and replies, and relevant discharge summaries from inpatient admissions. GPT-4 was tasked to write discharge letters for these cases with a specified target audience of primary care providers who would be continuing the patient's care. Prompts were written for safety, content, and style. Concurrently, junior clinicians were provided with the same case records and instructional prompts. GPT-4 output was assessed for instances of hallucination. A blinded panel of primary care physicians then evaluated the letters using a standardized questionnaire tool. GPT-4 outperformed human counterparts in information provision (mean 4.32, SD 0.95 vs 3.70, SD 1.27; P=.03) and had no instances of hallucination. There were no statistically significant differences in the mean clarity (4.16, SD 0.95 vs 3.68, SD 1.24; P=.12), collegiality (4.36, SD 1.00 vs 3.84, SD 1.22; P=.05), conciseness (3.60, SD 1.12 vs 3.64, SD 1.27; P=.71), follow-up recommendations (4.16, SD 1.03 vs 3.72, SD 1.13; P=.08), and overall satisfaction (3.96, SD 1.14 vs 3.62, SD 1.34; P=.36) between the letters generated by GPT-4 and humans, respectively. Discharge letters written by GPT-4 had equivalent quality to those written by junior clinicians, without any hallucinations. This study provides a proof of concept that large language models can be useful and safe tools in clinical documentation.

Journal Article

Share this book

Add to My Shelf

Democratizing Artificial Intelligence Imaging Analysis With Automated Machine Learning: Tutorial

by Gutierrez, Laura , Elangovan, Kabilan , Tan, Iris in Artificial Intelligence , Artificial intelligence literacy , Automation

2023

Deep learning–based clinical imaging analysis underlies diagnostic artificial intelligence (AI) models, which can match or even exceed the performance of clinical experts, having the potential to revolutionize clinical practice. A wide variety of automated machine learning (autoML) platforms lower the technical barrier to entry to deep learning, extending AI capabilities to clinicians with limited technical expertise, and even autonomous foundation models such as multimodal large language models. Here, we provide a technical overview of autoML with descriptions of how autoML may be applied in education, research, and clinical practice. Each stage of the process of conducting an autoML project is outlined, with an emphasis on ethical and technical best practices. Specifically, data acquisition, data partitioning, model training, model validation, analysis, and model deployment are considered. The strengths and limitations of available code-free, code-minimal, and code-intensive autoML platforms are considered. AutoML has great potential to democratize AI in medicine, improving AI literacy by enabling “hands-on” education. AutoML may serve as a useful adjunct in research by facilitating rapid testing and benchmarking before significant computational resources are committed. AutoML may also be applied in clinical contexts, provided regulatory requirements are met. The abstraction by autoML of arduous aspects of AI engineering promotes prioritization of data set curation, supporting the transition from conventional model-driven approaches to data-centric development. To fulfill its potential, clinicians must be educated on how to apply these technologies ethically, rigorously, and effectively; this tutorial represents a comprehensive summary of relevant considerations.

Journal Article

Share this book

Add to My Shelf

Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness

by Sia, Alex Tiong Heng , Ong, Jasmine Chiat Ling , Elangovan, Kabilan in 692/700/139 , 692/700/1750 , Accuracy

2025

Large Language Models (LLMs) hold promise for medical applications but often lack domain-specific expertise. Retrieval Augmented Generation (RAG) enables customization by integrating specialized knowledge. This study assessed the accuracy, consistency, and safety of LLM-RAG models in determining surgical fitness and delivering preoperative instructions using 35 local and 23 international guidelines. Ten LLMs (e.g., GPT3.5, GPT4, GPT4o, Gemini, Llama2, and Llama3, Claude) were tested across 14 clinical scenarios. A total of 3234 responses were generated and compared to 448 human-generated answers. The GPT4 LLM-RAG model with international guidelines generated answers within 20 s and achieved the highest accuracy, which was significantly better than human-generated responses (96.4% vs. 86.6%, p = 0.016). Additionally, the model exhibited an absence of hallucinations and produced more consistent output than humans. This study underscores the potential of GPT-4-based LLM-RAG models to deliver highly accurate, efficient, and consistent preoperative assessments.

Journal Article

Share this book

Add to My Shelf

Large Language Models in Randomized Controlled Trials Design: Observational Study

by Pyle, Alexandra , Ong, Jasmine Chiat Ling , Elangovan, Kabilan in AI Language Models in Health Care , Applications of AI , Artificial Intelligence

2025

Randomized controlled trials (RCTs) face challenges such as limited generalizability, insufficient recruitment diversity, and high failure rates, often due to restrictive eligibility criteria and inefficient patient selection. Large language models (LLMs) have shown promise in various clinical tasks, but their potential role in RCT design remains underexplored. This study investigates the ability of LLMs, specifically GPT-4-Turbo-Preview, to assist in designing RCTs that enhance generalizability, recruitment diversity, and reduce failure rates, while maintaining clinical safety and ethical standards. We conducted a noninterventional, observational study analyzing 20 parallel-arm RCTs, comprising 10 completed and 10 registered studies published after January 2024 to mitigate pretraining biases. The LLM was tasked with generating RCT designs based on input criteria, including eligibility, recruitment strategies, interventions, and outcomes. The accuracy of LLM-generated designs was quantitatively assessed by 2 independent clinical experts by comparing them to clinically validated ground truth data from ClinicalTrials.gov. We have conducted statistical analysis using natural language processing-based methods, including Bilingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE)-L, and Metric for Evaluation of Translation with Explicit ORdering (METEOR), for objective scoring on corresponding LLM outputs. Qualitative assessments were performed using Likert scale ratings (1-3) for domains such as safety, clinical accuracy, objectivity or bias, pragmatism, inclusivity, and diversity. The LLM achieved an overall accuracy of 72% in replicating RCT designs. Recruitment and intervention designs demonstrated high agreement with the ground truth, achieving 88% and 93% accuracy, respectively. However, LLMs showed lower accuracy in designing eligibility criteria (55%) and outcomes measurement (53%). Natural language processing statistical analysis reported BLEU=0.04, ROUGE-L=0.20, and METEOR=0.18 on average objective scoring of LLM outputs. Qualitative evaluations showed that LLM-generated designs scored above 2 points and closely matched the original designs in scores across all domains, indicating strong clinical alignment. Specifically, both original and LLM-based designs ranked similarly high in safety, clinical accuracy, and objectivity or bias in published RCTs. Moreover, LLM-based design ranked noninferior to original designs in registered RCTs in multiple domains. In particular, LLMs enhanced diversity and pragmatism, which are key factors in improving RCT generalizability and addressing failure rates. LLMs, such as GPT-4-Turbo-Preview, have demonstrated potential in improving RCT design, particularly in recruitment and intervention planning, while enhancing generalizability and addressing diversity. However, expert oversight and regulatory measures are essential to ensure patient safety and ethical standards. The findings support further integration of LLMs into clinical trial design, although continued refinement is necessary to address limitations in eligibility and outcomes measurement.

Journal Article

Share this book

Add to My Shelf

A scoping review on generative AI and large language models in mitigating medication related harm

by Ong, Jasmine Chiat Ling , Elangovan, Kabilan , Tan, Nichole Yue Ting in 692/308 , 692/700 , Biomedicine

2025

Medication-related harm has a significant impact on global healthcare costs and patient outcomes. Generative artificial intelligence (GenAI) and large language models (LLM) have emerged as a promising tool in mitigating risks of medication-related harm. This review evaluates the scope and effectiveness of GenAI and LLM in reducing medication-related harm. We screened 4 databases for literature published from 1st January 2012 to 15th October 2024. A total of 3988 articles were identified, and 30 met the criteria for inclusion into the final review. Generative AI and LLMs were applied in three key applications: drug-drug interaction identification and prediction, clinical decision support, and pharmacovigilance. While the performance and utility of these models varied, they generally showed promise in early identification, classification of adverse drug events, and supporting decision-making for medication management. However, no studies tested these models prospectively, suggesting a need for further investigation into integration and real-world application.

Journal Article

Share this book

Add to My Shelf

Vision-language large learning model, GPT4V, accurately classifies the Boston Bowel Preparation Scale score

by Tan, Chee Kiat , Lim, Daniel Yan Zheng , Ong, Jasmine Chiat Ling in Application programming interface , Artificial Intelligence , Automation

2025

IntroductionLarge learning models (LLMs) such as GPT are advanced artificial intelligence (AI) models. Originally developed for natural language processing, they have been adapted for multi-modal tasks with vision-language input. One clinically relevant task is scoring the Boston Bowel Preparation Scale (BBPS). While traditional AI techniques use large amounts of data for training, we hypothesise that vision-language LLM can perform this task with fewer examples.MethodsWe used the GPT4V vision-language LLM developed by OpenAI, via the OpenAI application programming interface. A standardised prompt instructed the model to grade BBPS with contextual references extracted from the original paper describing the BBPS by Lai et al (GIE 2009). Performance was tested on the HyperKvasir dataset, an open dataset for automated BBPS grading.ResultsOf 1794 images, GPT4V returned valid results for 1772 (98%). It had an accuracy of 0.84 for two-class classification (BBPS 0–1 vs 2–3) and 0.74 for four-class classification (BBPS 0, 1, 2, 3). Macro-averaged F1 scores were 0.81 and 0.63, respectively. Qualitatively, most errors arose from misclassification of BBPS 1 as 2. These results compare favourably with current methods using large amounts of training data, which achieve an accuracy in the range of 0.8–0.9.ConclusionThis study provides proof-of-concept that a vision-language LLM is able to perform BBPS classification accurately, without large training datasets. This represents a paradigm shift in AI classification methods in medicine, where many diseases lack sufficient data to train traditional AI models. An LLM with appropriate examples may be used in such cases.

Journal Article

Share this book

Add to My Shelf

Large language models in medicine

by Gutierrez, Laura , Tan, Ting Fang , Elangovan, Kabilan in 692/308/575 , 692/700/1719 , Artificial intelligence

2023

Large language models (LLMs) can respond to free-text queries without being specifically trained in the task in question, causing excitement and concern about their use in healthcare settings. ChatGPT is a generative artificial intelligence (AI) chatbot produced through sophisticated fine-tuning of an LLM, and other tools are emerging through similar developmental processes. Here we outline how LLM applications such as ChatGPT are developed, and we discuss how they are being leveraged in clinical settings. We consider the strengths and limitations of LLMs and their potential to improve the efficiency and effectiveness of clinical, educational and research work in medicine. LLM chatbots have already been deployed in a range of biomedical contexts, with impressive but mixed results. This review acts as a primer for interested clinicians, who will determine if and how LLM technology is used in healthcare for the benefit of patients and practitioners. This review explains how large language models (LLMs), such as ChatGPT, are developed and discusses their strengths and limitations in the context of potential clinical applications.

Journal Article

Share this book

Add to My Shelf

Development and evaluation of a lightweight large language model chatbot for medication enquiry

by Zhong, Ryan Jian , Kwan, Yu Heng , Ng, Lit Soo in Biology and Life Sciences , Computer and Information Sciences , Engineering and Technology

2025

Large Language Models (LLMs) show promise in augmenting digital health applications. However, development and scaling of large models face computational constraints, data security concerns and limitations of internet accessibility in some regions. We developed and tested Med-Pal, a medical domain-specific LLM-chatbot fine-tuned with a fine-grained, expert curated medication-enquiry dataset consisting of 1,100 question and answer pairs. We trained and validated five light-weight, open-source LLMs of smaller parameter size (7 billion or less) on a validation dataset of 231 medication-related enquiries. We introduce SCORE, an LLM-specific evaluation criteria for clinical adjudication of LLM responses, performed by a multidisciplinary expert team. The best performing lighted-weight LLM was chosen as Med-Pal for further engineering with guard-railing against adversarial prompts. Med-Pal outperformed Biomistral and Meerkat, achieving 71.9% high-quality responses in a separate testing dataset. Med-Pal’s light-weight architecture, clinical alignment and safety guardrails enable implementation under varied settings, including those with limited digital infrastructure.

Journal Article

Share this book

Add to My Shelf

When Fine-Tuning Changes the Evidence: Architecture-Dependent Semantic Drift in Chest X-Ray Explanations

by Elangovan, Kabilan , Ting, Daniel in Chest , Drift , Image classification

2026

Transfer learning followed by fine-tuning is widely adopted in medical image classification due to consistent gains in diagnostic performance. However, in multi-class settings with overlapping visual features, improvements in accuracy do not guarantee stability of the visual evidence used to support predictions. We define semantic drift as systematic changes in the attribution structure supporting a model's predictions between transfer learning and full fine-tuning, reflecting potential shifts in underlying visual reasoning despite stable classification performance. Using a five-class chest X-ray task, we evaluate DenseNet201, ResNet50V2, and InceptionV3 under a two-stage training protocol and quantify drift with reference-free metrics capturing spatial localization and structural consistency of attribution maps. Across architectures, coarse anatomical localization remains stable, while overlap IoU reveals pronounced architecture-dependent reorganization of evidential structure. Beyond single-method analysis, stability rankings can reverse across LayerCAM and GradCAM++ under converged predictive performance, establishing explanation stability as an interaction between architecture, optimization phase, and attribution objective.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter