Catalogue Search | MBRL

Evaluating and addressing demographic disparities in medical large language models: a systematic review

by Sakhuja, Ankit , Horowitz, Carol R. , Agbareia, Reem in Accuracy , Algorithms , Artificial intelligence

2025

Background Large language models are increasingly evaluated for use in healthcare. However, concerns about their impact on disparities persist. This study reviews current research on demographic biases in large language models to identify prevalent bias types, assess measurement methods, and evaluate mitigation strategies. Methods We conducted a systematic review, searching publications from January 2018 to July 2024 across five databases. We included peer-reviewed studies evaluating demographic biases in large language models, focusing on gender, race, ethnicity, age, and other factors. Study quality was assessed using the Joanna Briggs Institute Critical Appraisal Tools. Results Our review included 24 studies. Of these, 22 (91.7%) identified biases. Gender bias was the most prevalent, reported in 15 of 16 studies (93.7%). Racial or ethnic biases were observed in 10 of 11 studies (90.9%). Only two studies found minimal or no bias in certain contexts. Mitigation strategies mainly included prompt engineering, with varying effectiveness. However, these findings are tempered by a potential publication bias, as studies with negative results are less frequently published. Conclusion Biases are observed in large language models across various medical domains. While bias detection is improving, effective mitigation strategies are still developing. As LLMs increasingly influence critical decisions, addressing these biases and their resultant disparities is essential for ensuring fair artificial intelligence systems. Future research should focus on a wider range of demographic factors, intersectional analyses, and non-Western cultural contexts. Graphic Abstract

Journal Article

Share this book

Add to My Shelf

Large language models in medicine: A review of current clinical trials across healthcare applications

by Klang, Eyal , Omar, Mahmud , Nadkarni, Girish N. in Artificial intelligence , Biology and Life Sciences , Chatbots

2024

This review analyzes current clinical trials investigating large language models’ (LLMs) applications in healthcare. We identified 27 trials (5 published and 22 ongoing) across 4 main clinical applications: patient care, data handling, decision support, and research assistance. Our analysis reveals diverse LLM uses, from clinical documentation to medical decision-making. Published trials show promise but highlight accuracy concerns. Ongoing studies explore novel applications like patient education and informed consent. Most trials occur in the United States of America and China. We discuss the challenges of evaluating rapidly evolving LLMs through clinical trials and identify gaps in current research. This review aims to inform future studies and guide the integration of LLMs into clinical practice.

Journal Article

Share this book

Add to My Shelf

The IL-23/IL-17 axis in Behçet’s syndrome pathogenesis: from immunological perspectives to therapeutic implications

by Hassan, Fadi , Omar, Mahmud , Naffaa, Mohammad E. in Adaptive immunity , Animals , Antibodies, Monoclonal, Humanized - therapeutic use

2026

Behçet’s Syndrome (BS) is a systemic vasculitis characterized by variable vessel involvement and an elusive etiology, though immunogenetic studies strongly implicate the IL-23/IL-17 axis which bridges innate and adaptive immunity, orchestrating type 17 T-cell responses thus modulating neutrophil function- with this cell a central player in both BS clinical features and immunopathology. Additionally, the contribution of Th1 cytokines—such as interferon gamma (IFNγ) and tumor necrosis factor alpha (TNFα)—reflects the broader immune plasticity observed in BS pathophysiology. Despite the immunogenetics incriminating the IL-23/IL-17 axis, clinical evidence confirming the role of IL-23/IL-17/inhibition in BS therapy is still limited including disappointing results with secukinumab in trials for Behçet’s uveitis. However, emerging evidence from small-scale retrospective studies, prospective trials, and case reports indicates that IL-23/IL-17 axis inhibition may benefit mucocutaneous and articular manifestations, as well as neuro-Behçet’s disease and the licensed PDE4 inhibitor apremilast regulates multiple aspects of IL-23/17 axis and neutrophil biology. Interestingly, anti-IL-17 therapy has been linked to BS induction. Herein, we discuss IL-23/IL-17 axis inhibition in BS and why it should be used cautiously and be limited to mucocutaneous and/or articular manifestations at this juncture. Further randomized controlled trials are imperative to dissect the IL-23/IL-17 axis in BS including high-dose anti-IL-23 therapy antagonism given that neutrophils are an abundant source of IL-23 and consider novel strategies including IL-23R antagonism.

Journal Article

Share this book

Add to My Shelf

Sociodemographic biases in medical decision making by large language models

by Bragazzi, Nicola Luigi , Horowitz, Carol R. , Agbareia, Reem in 692/700/3935 , 692/700/478 , Adult

2025

Large language models (LLMs) show promise in healthcare, but concerns remain that they may produce medically unjustified clinical care recommendations reflecting the influence of patients’ sociodemographic characteristics. We evaluated nine LLMs, analyzing over 1.7 million model-generated outputs from 1,000 emergency department cases (500 real and 500 synthetic). Each case was presented in 32 variations (31 sociodemographic groups plus a control) while holding clinical details constant. Compared to both a physician-derived baseline and each model’s own control case without sociodemographic identifiers, cases labeled as Black or unhoused or identifying as LGBTQIA+ were more frequently directed toward urgent care, invasive interventions or mental health evaluations. For example, certain cases labeled as being from LGBTQIA+ subgroups were recommended mental health assessments approximately six to seven times more often than clinically indicated. Similarly, cases labeled as having high-income status received significantly more recommendations ( P < 0.001) for advanced imaging tests such as computed tomography and magnetic resonance imaging, while low- and middle-income-labeled cases were often limited to basic or no further testing. After applying multiple-hypothesis corrections, these key differences persisted. Their magnitude was not supported by clinical reasoning or guidelines, suggesting that they may reflect model-driven bias, which could eventually lead to health disparities rather than acceptable clinical variation. Our findings, observed in both proprietary and open-source models, underscore the need for robust bias evaluation and mitigation strategies to ensure that LLM-driven medical advice remains equitable and patient centered. A panel of nine LLMs was exposed to simulated clinical cases with switched sociodemographic features exploring ethnic, social, sexual orientation and gender dimensions and showed differences in recommendations for patient treatment, referral and follow-up based only on these features.

Journal Article

Share this book

Add to My Shelf

Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study

by Klang, Eyal , Agbareia, Reem , Omar, Mahmud in Accuracy , Application programming interface , Benchmarking - methods

2025

The capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored. This study evaluates the confidence levels of 12 LLMs across 5 medical specialties to assess LLMs' ability to accurately judge their own responses. We used 1965 multiple-choice questions that assessed clinical knowledge in the following areas: internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and to also provide their confidence for the correct answers (score: range 0%-100%). We calculated the correlation between each model's mean confidence score for correct answers and the overall accuracy of each model across all questions. The confidence scores for correct and incorrect answers were also analyzed to determine the mean difference in confidence, using 2-sample, 2-tailed t tests. The correlation between the mean confidence scores for correct answers and model accuracy was inverse and statistically significant (r=-0.40; P=.001), indicating that worse-performing models exhibited paradoxically higher confidence. For instance, a top-performing model-GPT-4o-had a mean accuracy of 74% (SD 9.4%), with a mean confidence of 63% (SD 8.3%), whereas a low-performing model-Qwen2-7B-showed a mean accuracy of 46% (SD 10.5%) but a mean confidence of 76% (SD 11.7%). The mean difference in confidence between correct and incorrect responses was low for all models, ranging from 0.6% to 5.4%, with GPT-4o having the highest mean difference (5.4%, SD 2.3%; P=.003). Better-performing LLMs show more aligned overall confidence levels. However, even the most accurate models still show minimal variation in confidence between right and wrong answers. This may limit their safe use in clinical settings. Addressing overconfidence could involve refining calibration methods, performing domain-specific fine-tuning, and involving human oversight when decisions carry high risks. Further research is needed to improve these strategies before broader clinical adoption of LLMs.

Journal Article

Share this book

Add to My Shelf

Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support

by Bragazzi, Nicola Luigi , Nadkarni, Girish N. , Charney, Alexander in 631/114 , 692/308 , Automation

2025

Background Large language models (LLMs) show promise in clinical contexts but can generate false facts (often referred to as “hallucinations”). One subset of these errors arises from adversarial attacks, in which fabricated details embedded in prompts lead the model to produce or elaborate on the false information. We embedded fabricated content in clinical prompts to elicit adversarial hallucination attacks in multiple large language models. We quantified how often they elaborated on false details and tested whether a specialized mitigation prompt or altered temperature settings reduced errors. Methods We created 300 physician-validated simulated vignettes, each containing one fabricated detail (a laboratory test, a physical or radiological sign, or a medical condition). Each vignette was presented in short and long versions—differing only in word count but identical in medical content. We tested six LLMs under three conditions: default (standard settings), mitigating prompt (designed to reduce hallucinations), and temperature 0 (deterministic output with maximum response certainty), generating 5,400 outputs. If a model elaborated on the fabricated detail, the case was classified as a “hallucination”. Results Hallucination rates range from 50 % to 82 % across models and prompting methods. Prompt-based mitigation lowers the overall hallucination rate (mean across all models) from 66 % to 44 % ( p < 0.001). For the best-performing model, GPT-4o, rates decline from 53 % to 23 % ( p < 0.001). Temperature adjustments offer no significant improvement. Short vignettes show slightly higher odds of hallucination. Conclusions LLMs are highly susceptible to adversarial hallucination attacks, frequently generating false clinical details that pose risks when used without safeguards. While prompt engineering reduces errors, it does not eliminate them. Plain language summary Large language models (LLM), such as ChatGPT, are artificial intelligence-based computer programs that generate text based on information they are provided to train from. We test six large language models with 300 pieces of text similar to those written by doctors as clinical notes, but containing a single fake lab value, sign, or disease. We find that the LLM models repeat or elaborate on the planted error in up to 83 % of cases. Adopting strategies to prevent the impact of inappropriate instructions can half the rate but does not eliminate the risk of errors remaining. Our results highlight that caution should be taken when using LLM to interpret clinical notes. Omar et al. test six leading large language models with 300 doctor-designed clinical vignettes containing a single fake lab value, sign or disease. They show that the models repeat or elaborate on the planted error in up to 83 % of cases with a simple mitigation prompt halving the rate but not eliminating the risk.

Journal Article

Share this book

Add to My Shelf

Big data- and machine learning-based analysis of a global pharmacovigilance database enables the discovery of sex-specific differences in the safety profile of dual IL4/IL13 blockade

by Bragazzi, Nicola Luigi , Watad, Abdulla , Sharif, Kassem in Asthma , atopic dermatitis , big data analytics

2023

Background: Due to its apparent efficacy and safety, dupilumab, a monoclonal antibody that blocks Interleukin 4 (IL-4) and Interleukin 13 (IL-13), has been approved for treating T-helper 2 (Th2) disorders. However, adverse effects like local injection site reactions, conjunctivitis, headaches, and nasopharyngitis have been reported. Sex differences are known to influence both adaptive and innate immune responses and, thus, may have a bearing on the occurrence of these adverse effects. Nevertheless, the literature lacks a comprehensive exploration of this influence, a gap this study aims to bridge. Materials and Methods: A comprehensive data mining of VigiBase, the World Health Organization (WHO) global pharmacovigilance database which contains case safety reports of adverse drug reactions (ADRs) was performed to test for sex -specific safety response to dual IL4/IL13 blockade by dupilumab. The information component (IC), a measure of the disproportionality of ADR occurrence, was evaluated and compared between males and females to identify potential sexual dimorphism. Results: Of the 94,065 ADRs recorded in the WHO global pharmacovigilance database, 2,001 (57.4%) were reported among female dupilumab users, and 1,768 (50.7%) were among males. Immune/autoimmune T-helper 1 (Th1)-, innate- and T-helper 17 (Th17)-driven diseases and degenerative ones were consistently reported with a stronger association with Dupilumab in males than females. Some adverse events were more robustly associated with Dupilumab in females. Conclusion: Dupilumab has an excellent safety profile, even though some ADRs may occur. The risk is higher among male patients, further studies, including ad hoc studies, are needed to establish causality.

Journal Article

Share this book

Add to My Shelf

Limited Clinical Impact of Genetic Associations between Celiac Disease and Type 2 Inflammatory Diseases: Insights from Mendelian Randomization

by Nassar, Salih , Sharif, Kassem , Omar, Mahmud in Allergic rhinitis , Asthma , Atopic dermatitis

2024

Background: Celiac disease, a gluten-triggered autoimmune disorder, is known for its systemic inflammatory effects. Its genetic associations with type 2 inflammatory diseases like asthma, allergic rhinitis, and atopic dermatitis remain unclear, prompting this study to explore their potential genetic interplay. Methods: Utilizing two-sample Mendelian randomization (TSMR), we examined the genetic associations using 15 genetic instruments from GWAS datasets. Our analysis focused on celiac disease and its relation to asthma, allergic rhinitis, atopic dermatitis, and IgE-mediated food allergies. A power analysis was conducted to determine the study’s detection capabilities, and odds ratios (ORs) with 95% confidence intervals (CIs) were calculated using various MR methods. Results: Our Mendelian randomization analysis identified statistically significant genetic associations between celiac disease and several type 2 inflammatory diseases, although these were practically insignificant. Specifically, celiac disease was associated with a slight increase in the risk of atopic dermatitis (OR = 1.037) and a minor protective effect against asthma (OR = 0.97). The link with allergic rhinitis was statistically detectable (OR = 1.002) but practically negligible. Despite robust statistical confirmation through various sensitivity analyses, all observed effects remained within the range of practical equivalence (ROPE). Conclusions: Our study identifies potential genetic associations between celiac disease and certain type 2 inflammatory diseases. However, these associations, predominantly within the ROPE range, suggest only limited clinical implications. These findings highlight the need for cautious interpretation and indicate that further exploration for clinical applications may not be warranted at this stage.

Journal Article

Share this book

Add to My Shelf

Emerging applications of NLP and large language models in gastroenterology and hepatology: a systematic review

by Klang, Eyal , Nassar, Salih , Omar, Mahmud in Accuracy , Bias , Boolean

2025

In the last years, natural language processing (NLP) has transformed significantly with the introduction of large language models (LLM). This review updates on NLP and LLM applications and challenges in gastroenterology and hepatology. Registered with PROSPERO (CRD42024542275) and adhering to PRISMA guidelines, we searched six databases for relevant studies published from 2003 to 2024, ultimately including 57 studies. Our review of 57 studies notes an increase in relevant publications in 2023-2024 compared to previous years, reflecting growing interest in newer models such as GPT-3 and GPT-4. The results demonstrate that NLP models have enhanced data extraction from electronic health records and other unstructured medical data sources. Key findings include high precision in identifying disease characteristics from unstructured reports and ongoing improvement in clinical decision-making. Risk of bias assessments using ROBINS-I, QUADAS-2, and PROBAST tools confirmed the methodological robustness of the included studies. NLP and LLMs can enhance diagnosis and treatment in gastroenterology and hepatology. They enable extraction of data from unstructured medical records, such as endoscopy reports and patient notes, and for enhancing clinical decision-making. Despite these advancements, integrating these tools into routine practice is still challenging. Future work should prospectively demonstrate real-world value.

Journal Article

Share this book

Add to My Shelf

The association between psoriasis, psoriasis severity, and inflammatory bowel disease: a population-based analysis

by Zacay, Galia , Qassem, Roula , Watad, Abdulla in Crohn's disease , Health maintenance organizations , HMOs

2024

Background: The skin–gut axis, characterized by bidirectional communication between the skin and gut, plays a crucial role in the pathogenesis of psoriasis and inflammatory bowel diseases (IBD). Objectives: We aimed to explore the association between psoriasis and IBD and identify predictors associated with IBD development among patients with psoriasis. Design: Retrospective cohort study. Methods: A retrospective study which utilized an electronic database from the Meuhedet Health Maintenance Organization (MHMO) in Israel. Psoriasis was categorized as severe if any systemic agent or phototherapy was administered. Univariate and multivariate logistic regressions were used to identify specific predictors for IBD, with adjustments made for potential confounders. The study received approval from the Ethical Committee of the MHMO. Results: In total, 61,003 adult patients who were diagnosed with psoriasis between 2000 and 2022 were included. Among them, 1495/61,003 patients (2.4%) were diagnosed with IBD, as compared to 3834/244,012 patients (1.6%) in the non-psoriasis group [adjusted odds ratio (OR): 1.47; 95% confidence interval (CI): 1.37–1.56; p < 0.001]. Increased age (OR: 1.01; 95% CI: 1.01–1.02; p < 0.001), male gender (OR: 1.22; 95% CI: 1.03–1.45; p = 0.024), and Jewish ethnicity (OR: 2.5; 95% CI: 1.2–4.1; p < 0.001) were identified as significant risk factors for IBD. Spondyloarthropathies, including psoriatic arthritis (OR: 2.27; 95% CI: 1.86–2.77; p < 0.001) and ankylosing spondylitis (OR: 2.82; 95% CI: 1.5–5.32; p < 0.05), were associated with a higher prevalence of IBD. Furthermore, severe psoriasis was significantly associated with a higher likelihood of IBD, compared to mild psoriasis (OR: 16.03; 95% CI: 11.02–23.34; p < 0.001). Conclusion: A significant association between psoriasis and IBD was demonstrated, including its subtypes: Crohn’s disease and ulcerative colitis. Moreover, such association may depend on psoriasis severity as determined by the treatment used. This association warrants further investigation and implies a potential need for closer monitoring of patients with severe psoriasis. Plain language summary Association between psoriatic disease severity and risk of inflammatory bowel diseases 1- Gut and skin barrier play an integral role in psoriasis and inflammatory bowel disease (IBD) development. 2- Shared genetic and environmental factors could explain the association between both diseases. 3- We report increased association between psoriasis and IBD, a relationship that is more pronounced in patients with severe psoriasis. 4- Patients with spondyloarthritis related diseases have a stronger association with IBD.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter