Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
320 result(s) for "Generative Language Models Including ChatGPT"
Sort by:
ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis
Recent studies, including those by the National Board of Medical Examiners, have highlighted the remarkable capabilities of recent large language models (LLMs) such as ChatGPT in passing the United States Medical Licensing Examination (USMLE). However, there is a gap in detailed analysis of LLM performance in specific medical content areas, thus limiting an assessment of their potential utility in medical education. This study aimed to assess and compare the accuracy of successive ChatGPT versions (GPT-3.5, GPT-4, and GPT-4 Omni) in USMLE disciplines, clinical clerkships, and the clinical skills of diagnostics and management. This study used 750 clinical vignette-based multiple-choice questions to characterize the performance of successive ChatGPT versions (ChatGPT 3.5 [GPT-3.5], ChatGPT 4 [GPT-4], and ChatGPT 4 Omni [GPT-4o]) across USMLE disciplines, clinical clerkships, and in clinical skills (diagnostics and management). Accuracy was assessed using a standardized protocol, with statistical analyses conducted to compare the models' performances. GPT-4o achieved the highest accuracy across 750 multiple-choice questions at 90.4%, outperforming GPT-4 and GPT-3.5, which scored 81.1% and 60.0%, respectively. GPT-4o's highest performances were in social sciences (95.5%), behavioral and neuroscience (94.2%), and pharmacology (93.2%). In clinical skills, GPT-4o's diagnostic accuracy was 92.7% and management accuracy was 88.8%, significantly higher than its predecessors. Notably, both GPT-4o and GPT-4 significantly outperformed the medical student average accuracy of 59.3% (95% CI 58.3-60.3). GPT-4o's performance in USMLE disciplines, clinical clerkships, and clinical skills indicates substantial improvements over its predecessors, suggesting significant potential for the use of this technology as an educational aid for medical students. These findings underscore the need for careful consideration when integrating LLMs into medical education, emphasizing the importance of structured curricula to guide their appropriate use and the need for ongoing critical analyses to ensure their reliability and effectiveness.
Comparing the Perspectives of Generative AI, Mental Health Experts, and the General Public on Schizophrenia Recovery: Case Vignette Study
The current paradigm in mental health care focuses on clinical recovery and symptom remission. This model's efficacy is influenced by therapist trust in patient recovery potential and the depth of the therapeutic relationship. Schizophrenia is a chronic illness with severe symptoms where the possibility of recovery is a matter of debate. As artificial intelligence (AI) becomes integrated into the health care field, it is important to examine its ability to assess recovery potential in major psychiatric disorders such as schizophrenia. This study aimed to evaluate the ability of large language models (LLMs) in comparison to mental health professionals to assess the prognosis of schizophrenia with and without professional treatment and the long-term positive and negative outcomes. Vignettes were inputted into LLMs interfaces and assessed 10 times by 4 AI platforms: ChatGPT-3.5, ChatGPT-4, Google Bard, and Claude. A total of 80 evaluations were collected and benchmarked against existing norms to analyze what mental health professionals (general practitioners, psychiatrists, clinical psychologists, and mental health nurses) and the general public think about schizophrenia prognosis with and without professional treatment and the positive and negative long-term outcomes of schizophrenia interventions. For the prognosis of schizophrenia with professional treatment, ChatGPT-3.5 was notably pessimistic, whereas ChatGPT-4, Claude, and Bard aligned with professional views but differed from the general public. All LLMs believed untreated schizophrenia would remain static or worsen without professional treatment. For long-term outcomes, ChatGPT-4 and Claude predicted more negative outcomes than Bard and ChatGPT-3.5. For positive outcomes, ChatGPT-3.5 and Claude were more pessimistic than Bard and ChatGPT-4. The finding that 3 out of the 4 LLMs aligned closely with the predictions of mental health professionals when considering the \"with treatment\" condition is a demonstration of the potential of this technology in providing professional clinical prognosis. The pessimistic assessment of ChatGPT-3.5 is a disturbing finding since it may reduce the motivation of patients to start or persist with treatment for schizophrenia. Overall, although LLMs hold promise in augmenting health care, their application necessitates rigorous validation and a harmonious blend with human expertise.
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative Study
Diagnosing rare diseases remains challenging due to their inherent complexity and limited physician knowledge. Large language models (LLMs) offer new potential to enhance diagnostic workflows. This study aimed to evaluate the diagnostic accuracy of ChatGPT-4o and 4 open-source LLMs (qwen2.5:7b, Llama3.1:8b, qwen2.5:72b, and Llama3.1:70b) for rare diseases, assesses the language effect on diagnostic performance, and explore retrieval augmented generation (RAG) and chain-of-thought (CoT) reasoning. We extracted clinical manifestations of 121 rare diseases from China's inaugural rare disease catalog. ChatGPT-4o generated a primary and 5 differential diagnoses, while 4 LLMs were assessed in both English and Chinese contexts. The lowest-performing model underwent RAG and CoT re-evaluation. Diagnostic accuracy was compared via the McNemar test. A survey evaluated 11 clinicians' familiarity with rare diseases. ChatGPT-4o demonstrated the highest diagnostic accuracy with 90.1%. Language effects varied across models: qwen2.5:7b showed comparable performance in Chinese (51.2%) and English (47.9%; χ²1=0.32, P=.57), whereas Llama3.1:8b exhibited significantly higher English accuracy (67.8% vs 31.4%; χ²1=40.20, P<.001). Among larger models, qwen2.5:72b maintained cross-lingual consistency considering the odds ratio (OR; Chinese: 82.6% vs English: 83.5%; OR 0.88, 95% CI 0.27-2.76,P=1.000), contrasting with Llama3.1:70b's language-dependent variation (Chinese: 80.2% vs English: 90.1%; OR 0.29,95% CI 0.08-0.83, P=.02). Cross-model comparisons revealed Llama3.1:8b underperformed qwen2.5:7b in Chinese (χ²1=13.22,P<.001) but surpassed it in English (χ²1=13.92,P<.001). No significant differences were observed between qwen2.5:72b and Llama3.1:70b (English: OR 0.33, P=.08; Chinese: OR 1.5, 95% CI 0.48-5.12,P=.07); qwen2.5:72b matched ChatGPT-4o's performance in both languages (English: OR 0.33, P=.08; Chinese: OR 0.44, P=.09); Llama3.1:70b mirrored ChatGPT-4o's English accuracy (OR 1, P=1.000) but lagged in Chinese (OR 0.33; P=.02). RAG implementation enhanced qwen2.5:7b's accuracy to 79.3% (χ²1=31.11, P<.001) with 85.9% retrieval precision. The distilled model Deepseek-R1:7b markedly underperformed (9.9% vs qwen2.5:7b; χ²1=42.19, P<.001). Clinician surveys revealed significant knowledge gaps in rare disease management. ChatGPT-4o demonstrated superior diagnostic performance for rare diseases. While Llama3.1:8b demonstrates viability for localized deployment in resource-constrained English diagnostic workflows, Chinese applications require larger models to achieve comparable diagnostic accuracy. This urgency is heightened by the release of open-source models like DeepSeek-R1, which may see rapid adoption without thorough validation. Successful clinical implementation of LLMs requires 3 core elements: model parameterization, user language, and pretraining data. The integration of RAG significantly enhanced open-source LLM accuracy for rare disease diagnosis, although caution remains warranted for low-parameter reasoning models showing substantial performance limitations. We recommend hospital IT departments and policymakers prioritize language relevance in model selection and consider integrating RAG with curated knowledge bases to enhance diagnostic utility in constrained settings, while exercising caution with low-parameter models.
Artificial Intelligence Tools for Automating Evidence Synthesis: Scoping Review
Rapidly and accurately synthesizing large volumes of evidence is a time- and resource-intensive process. Once published, reviews often risk becoming outdated, limiting their usefulness for decision makers. Recent advancements in artificial intelligence (AI) have enabled researchers to automate stages of the evidence synthesis process, from literature searching and screening to data extraction and analysis. As previous reviews on this topic have been published, a significant number of tools have been further developed and evaluated. Furthermore, as generative AI increasingly automates evidence synthesis, understanding how it is studied and applied is crucial, given both its benefits and risks. This review aimed to map the current landscape of evaluated AI tools used to automate evidence synthesis. Following the Joanna Briggs Institute methodology for scoping reviews, we searched Ovid MEDLINE, Ovid Embase, Scopus, and Web of Science in February 2025 and conducted a gray literature search in April 2025. We included articles published in any language from January 2021 onward. Two reviewers independently screened citations using Rayyan, and data were extracted based on study design and key AI-related technical features. We identified 7841 unique citations through database searches and 19 records through gray literature searching. A total of 222 articles were included in the review. We identified 65 AI tools and 25 open-source models or machine learning (ML) algorithms that automate parts of or the whole evidence synthesis pathway. A total of 54.1% (n=120) of the studies were published in 2024, reflecting a trend toward researching general-purpose large language models (LLMs) for evidence synthesis automation. The most popular tool studied was generative pretrained transformer models, including its conversational interface ChatGPT (n=70, 31.5%). Moreover, 31.1% (n=69) studied tools automated by traditional ML algorithms. No studies compared traditional ML tools to LLM-based tools. In addition, 61.7% (n=137) and 26.1% (n=58) studied AI-assisted automation of title and abstract screening and data extraction, respectively, the 2 most intensive stages and, therefore, amenable to automation. Technical performance outcomes were the most frequently reported, with only 4.1% (n=9) of studies reporting time- or workload-specific outcomes. Few studies pragmatically evaluated AI tools in real-world evidence synthesis settings. This review comprehensively captures the broad, evolving suite of AI automation tools available to support evidence synthesis, leveraged by increasingly complex AI approaches that range from traditional ML to LLMs. The notable shift toward studying general-purpose generative AI tools reflects how these technologies are actively transforming evidence synthesis practice. The lack of studies in our review comparing different AI approaches for specific automation stages or evaluating their effectiveness pragmatically represents a significant research gap. Optimal tool selection will likely depend on the review topic and methodology and researcher priorities. While they offer potential for reducing workload, ongoing evaluation to mitigate AI bias and to ensure the integrity of reviews is essential for safeguarding evidence-based decision-making.
Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study
Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, including medical question-answering (QA). However, individual LLMs often exhibit varying performance across different medical QA datasets. We benchmarked individual zero-shot LLMs (GPT-4, Llama2-13B, Vicuna-13B, MedLlama-13B, and MedAlpaca-13B) to assess their baseline performance. Within the benchmark, GPT-4 achieves the best 71% on MedMCQA (medical multiple-choice question answering dataset), Vicuna-13B achieves 89.5% on PubMedQA (a dataset for biomedical question answering), and MedAlpaca-13B achieves the best 70% among all, showing the potential for better performance across different tasks and highlighting the need for strategies that can harness their collective strengths. Ensemble learning methods, combining multiple models to improve overall accuracy and reliability, offer a promising approach to address this challenge. To develop and evaluate efficient ensemble learning approaches, we focus on improving performance across 3 medical QA datasets through our proposed two ensemble strategies. Our study uses 3 medical QA datasets: PubMedQA (1000 manually labeled and 11,269 test, with yes, no, or maybe answered for each question), MedQA-USMLE (Medical Question Answering dataset based on the United States Medical Licensing Examination; 12,724 English board-style questions; 1272 test, 5 options), and MedMCQA (182,822 training/4183 test questions, 4-option multiple choice). We introduced the LLM-Synergy framework, consisting of two ensemble methods: (1) a Boosting-based Weighted Majority Vote ensemble, refining decision-making by adaptively weighting each LLM and (2) a Cluster-based Dynamic Model Selection ensemble, dynamically selecting optimal LLMs for each query based on question-context embeddings and clustering. Both ensemble methods outperformed individual LLMs across all 3 datasets. Specifically comparing the best individual LLM, the Boosting-based Majority Weighted Vote achieved accuracies of 35.84% on MedMCQA (+3.81%), 96.21% on PubMedQA (+0.64%), and 37.26% (tie) on MedQA-USMLE. The Cluster-based Dynamic Model Selection yields even higher accuracies of 38.01% (+5.98%) for MedMCQA, 96.36% (+1.09%) for PubMedQA, and 38.13% (+0.87%) for MedQA-USMLE. The LLM-Synergy framework, using 2 ensemble methods, represents a significant advancement in leveraging LLMs for medical QA tasks. Through effectively combining the strengths of diverse LLMs, this framework provides a flexible and efficient strategy adaptable to current and future challenges in biomedical informatics.
Prompt Engineering in Clinical Practice: Tutorial for Clinicians
Large language models (LLMs), such as OpenAI’s GPT series and Google’s PaLM, are transforming health care by improving clinical decision-making, enhancing patient communication, and simplifying administrative tasks. However, their performance relies heavily on prompt design, as small changes in wording or structure can greatly impact output quality. This presents challenges for clinicians who are not experts in natural language processing (NLP). This tutorial combines prompt engineering techniques tailored for clinical use, covering methods like zero-shot prompting, one-shot prompting, few-shot prompting, chain-of-thought prompting, self-consistency prompting, generated knowledge prompting, and meta-prompting. We provide actionable guidance on defining objectives, applying core principles, iterative prompt refinement, and integration into interoperable electronic health record (EHR) systems. This framework helps clinicians leverage LLMs to improve decision-making, streamline documentation, and enhance patient communication while maintaining ethical standards and ensuring patient safety.
Performance Evaluation of GPT-5, Grok 4, and DeepSeek R1 in Interpreting Complete Blood Count Reports for Hematologic Diseases: Retrospective Comparative Study
Large language models (LLMs) demonstrate potential in the laboratory, yet rigorous clinical evaluation remains limited. The opacity of LLM decision-making constrains their safe application in interpreting complete blood count (CBC) reports for hematologic diseases. This study aimed to conduct an exploratory evaluation of GPT-5, Grok 4, and DeepSeek R1 in interpreting real-world CBC reports, particularly their reasoning capabilities and clinical safety. This single-center retrospective study analyzed 100 CBC reports from initial-visit patients with hematologic conditions. After responses were generated by the 3 LLMs using standardized Chinese prompts, four trained laboratory physicians blindly evaluated them across 6 quality and 5 task dimensions. Interrater reliability was assessed using intraclass correlation coefficients (ICCs), and performance differences were assessed based on 4-rater consensus scores and Friedman and Wilcoxon tests. For task 4 (ablation analysis), the McNemar test was used to compare top-1 diagnostic concordance with the gold-standard diagnosis within each model, with and without initial clinical suspicion in the prompt. Error types and distributions were documented during the task evaluation. DeepSeek R1 demonstrated excellent interrater reliability across most quality dimensions (ICC ≥0.75). In the quality dimension, DeepSeek R1 significantly outperformed the other models in comprehensiveness, accuracy, clarity, relevance, and practicality. In the task 4 evaluation, GPT-5 demonstrated the highest concordance (93/100, 93%) with gold-standard diagnoses, followed by DeepSeek R1 (92/100, 92%) and Grok 4 (89/100, 89%). After removing the initial clinical suspicion, these rates decreased to 79% (79/100), 77% (77/100), and 72% (72/100), representing statistically significant within-model reductions for all models (P<.001). Post hoc error analysis revealed distinct patterns across task dimensions. GPT-5 exhibited 12 hallucinations in the analyzer alert processing task; DeepSeek R1 demonstrated 1 hallucination in the abnormal item identification task, whereas Grok 4 displayed none. All models exhibited reasoning errors and varying degrees of deficiencies in the correlation analysis and preliminary diagnosis tasks, characterized by unwarranted inferences of disease status from isolated results without clinical integration. Grok 4 generated 9 reasoning errors in the clinical management task by providing generic recommendations not tailored to case-specific CBC data, potentially compromising individualized treatment decisions. While current LLMs demonstrate potential for interpreting CBC reports in hematologic diseases, they show performance heterogeneity across models. The ablation study findings underscore the necessity of integrating clinical context for accurate laboratory test interpretation. Low scores, hallucinations, and reasoning errors in model outputs indicate that current clinical deployment requires human oversight and quality control. As this single-center, Chinese-language exploratory assessment provides only preliminary, possibly context-dependent evidence, multicenter, cross-lingual prospective validation is needed to delineate the practical boundaries and safety standards for clinical deployment.
Performance of Large Language Models in Diagnosing Rare Hematologic Diseases and the Impact of Their Diagnostic Outputs on Physicians: Combined Retrospective and Prospective Study
Rare hematologic diseases are frequently underdiagnosed or misdiagnosed due to their clinical complexity. Whether new-generation large language models (LLMs), particularly those using chain-of-thought reasoning, can improve diagnostic accuracy remains unclear. This study aimed to evaluate the diagnostic performance of new-generation commercial LLMs in rare hematologic diseases and to determine whether the LLM output enhances physicians' diagnostic accuracy. We conducted a 2-phase study. In the retrospective phase, we evaluated 7 mainstream LLMs on 158 nonpublic real-world admission records covering 9 rare hematologic diseases, assessed diagnostic performance using top-10 accuracy and mean reciprocal rank (MRR), and evaluated ranking stability via Jaccard similarity and entropy. Spearman rank correlation was used to examine the association between physicians' diagnoses and LLM-generated outputs. In the prospective phase, 28 physicians with varying levels of experience diagnosed 5 cases each, gaining access to LLM-generated diagnoses across 3 sequential steps to assess whether LLMs can improve diagnostic accuracy. In the retrospective phase, ChatGPT-o1-preview demonstrated the highest top-10 accuracy (70.3%) and MRR (0.577), and DeepSeek-R1 ranked second. Diagnostic performance was low for amyloid light-chain (AL) amyloidosis; Castleman disease; Erdheim-Chester disease; and polyneuropathy, organomegaly, endocrinopathy, monoclonal gammopathy, and skin changes (POEMS) syndrome. Interestingly, higher accuracy often correlated with lower ranking stability across most LLMs. The physician performance showed a strong correlation with both top-10 accuracy (ρ=0.565) and MRR (ρ=0.650). In the prospective phase, LLMs significantly improved the diagnostic accuracy of less-experienced physicians; no significant benefit was observed for specialists. However, when LLMs generated biased responses, physician performance often failed to improve or even declined. Without fine-tuning, new-generation commercial LLMs, particularly those with chain-of-thought reasoning, can identify diagnoses of rare hematologic diseases with high accuracy and significantly enhance the diagnostic performance of less-experienced physicians. Nevertheless, biased LLM outputs may mislead clinicians, highlighting the need for critical appraisal and cautious clinical integration with appropriate safeguard systems.
Large Language Models in Neurological Practice: Real-World Study
Large language models (LLMs) such as ChatGPT (OpenAI) and Gemini (Google) are increasingly explored for their potential in medical diagnostics, including neurology. Their real-world applicability remains inadequately assessed, particularly in clinical workflows where nuanced decision-making is required. This study aims to evaluate the diagnostic accuracy and appropriateness of clinical recommendations provided by not-specifically-trained, freely available ChatGPT and Gemini, compared to neurologists, using real-world clinical cases. This study consisted of an experimental evaluation of LLMs' diagnostic performance presenting real-world neurology cases to ChatGPT and Gemini, comparing their performance with that of clinical neurologists. The study was conducted simulating a first visit using information from anonymized patient records from the Neurology Department of the ASST Santi Paolo e Carlo Hospital, ensuring a real-world clinical context. The study involved a cohort of 28 anonymized patient cases covering a range of neurological conditions and diagnostic complexities representative of daily clinical practice. The primary outcome was diagnostic accuracy of both neurologists and LLMs, defined as concordance with discharge diagnoses. Secondary outcomes included the appropriateness of recommended diagnostic tests, interrater agreement, and the extent of additional prompting required for accurate responses. Neurologists achieved a diagnostic accuracy of 75%, outperforming ChatGPT (54%) and Gemini (46%). Both LLMs demonstrated limitations in nuanced clinical reasoning and overprescribed diagnostic tests in 17%-25% of cases. In addition, complex or ambiguous cases required further prompting to refine artificial intelligence-generated responses. Interrater reliability analysis using Fleiss Kappa showed a moderate-to-substantial level of agreement among raters (κ=0.47, SE 0.077; z=6.14, P<.001), indicating agreement between raters. While LLMs show potential as supportive tools in neurology, they currently lack the depth required for independent clinical decision-making when using freely available LLMs without previous specific training. The moderate agreement observed among human raters underscores the variability even in expert judgment and highlights the importance of rigorous validation when integrating artificial intelligence tools into clinical workflows. Future research should focus on refining LLM capabilities and developing evaluation methodologies that reflect the complexities of real-world neurological practice, ensuring effective, responsible, and safe use of such promising technologies.
Using Large Language Models to Assess the Consistency of Randomized Controlled Trials on AI Interventions With CONSORT-AI: Cross-Sectional Survey
Chatbots based on large language models (LLMs) have shown promise in evaluating the consistency of research. Previously, researchers used LLM to assess if randomized controlled trial (RCT) abstracts adhered to the CONSORT-Abstract guidelines. However, the consistency of artificial intelligence (AI) interventional RCTs aligning with the CONSORT-AI (Consolidated Standards of Reporting Trials-Artificial Intelligence) standards by LLMs remains unclear. The aim of this study is to identify the consistency of RCTs on AI interventions with CONSORT-AI using chatbots based on LLMs. This cross-sectional study employed 6 LLM models to assess the consistency of RCTs on AI interventions. The sample selection is based on articles published in JAMA Network Open, which included a total of 41 RCTs. All queries were submitted to LLMs through an application programming interface with a temperature setting of 0 to ensure deterministic responses. One researcher posed the questions to each model, while another independently verified the responses for validity before recording the results. The Overall Consistency Score (OCS), recall, inter-rater reliability, and consistency of contents were analyzed. We found gpt-4-0125-preview has the best average OCS on the basis of the results obtained by JAMA Network Open authors and by us (86.5%, 95% CI 82.5%-90.5% and 81.6%, 95% CI 77.6%-85.6%, respectively), followed by gpt-4-1106-preview (80.3%, 95% CI 76.3%-84.3% and 78.0%, 95% CI 74.0%-82.0%, respectively). The model with the worst average OCS is gpt-3.5-turbo-0125 on the basis of the results obtained by JAMA Network Open authors and by us (61.9%, 95% CI 57.9%-65.9% and 63.0%, 95% CI 59.0%-67.0%, respectively). Among the 11 unique items of CONSORT-AI, Item 2 (\"State the inclusion and exclusion criteria at the level of the input data\") received the poorest overall evaluation across the 6 models, with an average OCS of 48.8%. For other items, those with an average OCS greater than 80% across the 6 models included Items 1, 5, 8, and 9. GPT-4 variants demonstrate strong performance in assessing the consistency of RCTs with CONSORT-AI. Nonetheless, refining the prompts could enhance the precision and consistency of the outcomes. While AI tools like GPT-4 variants are valuable, they are not yet fully autonomous in addressing complex and nuanced tasks such as adherence to CONSORT-AI standards. Therefore, integrating AI with higher levels of human supervision and expertise will be crucial to ensuring more reliable and efficient evaluations, ultimately advancing the quality of medical research.