Asset Details

MbrlCatalogueTitleDetail

Do you wish to reserve the book?

Benchmarking large language models against clinicians across hospital levels in cardiovascular decision-making: a cross-sectional vignette-based study

by Zhang, Zixi , Dai, Yongguo , Liu, Qiming , Liu, Chan , Tu, Tao , Ma, Yingxu , Xiao, Yichao , Lin, Qiuzhen , Wang, Cancan

in 631/114 / 692/308 / 692/700 / Accuracy / Adult / Artificial intelligence / Benchmarking / Cardiovascular Diseases - diagnosis / Chatbots / ChatGPT 4.0 / China / Clinical Competence / Clinical decision-making / Clinical Decision-Making - methods / Cross-Sectional Studies / Decision making / DeepSeek-R1 / Female / Hospitals / Humanities and Social Sciences / Humans / Language / Large Language Models / Male / Memory / Middle Aged / multidisciplinary / Multiple choice / Science / Science (multidisciplinary) / Sensitivity analysis

2025

Yes Please

Hey, we have placed the reservation for you!

By the way, why not check out events that you can attend while you pick your title.

Oops! Something went wrong.

Looks like we were not able to place the reservation. Kindly try again later.

Are you sure you want to remove the book from the shelf?

Benchmarking large language models against clinicians across hospital levels in cardiovascular decision-making: a cross-sectional vignette-based study

by Zhang, Zixi , Dai, Yongguo , Liu, Qiming , Liu, Chan , Tu, Tao , Ma, Yingxu , Xiao, Yichao , Lin, Qiuzhen , Wang, Cancan

2025

Confirm

Do you wish to request the book?

Benchmarking large language models against clinicians across hospital levels in cardiovascular decision-making: a cross-sectional vignette-based study

by Zhang, Zixi , Dai, Yongguo , Liu, Qiming , Liu, Chan , Tu, Tao , Ma, Yingxu , Xiao, Yichao , Lin, Qiuzhen , Wang, Cancan

2025

Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy

How would you like to get it?

Submit

We have requested the book for you!

Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.

Oops! Something went wrong.

Looks like we were not able to place your request. Kindly try again later.

Journal Article

Benchmarking large language models against clinicians across hospital levels in cardiovascular decision-making: a cross-sectional vignette-based study

Zhang, Zixi,

Dai, Yongguo,

Liu, Qiming,

Liu, Chan,

Tu, Tao,

Ma, Yingxu,

Xiao, Yichao,

Lin, Qiuzhen,

Wang, Cancan

2025

Overview

Large language models (LLMs) have showed strong performance on standardized medical examinations, yet their comparative clinical relevance against human clinicians remains limited. This study benchmarked the performance of DeepSeek-R1 and ChatGPT 4.0 against cardiovascular clinicians from different hospital levels in China. We conducted a cross-sectional, vignette-based assessment consisting of 100 standardized cardiovascular multiple-choice questions covering four competency domains: clinical reasoning (CR), frontier updates (FU), basic memory (BM), and emergency decision (ED). Thirty clinicians from six hospitals (three primary and three tertiary) were compared with two LLMs. Each question was executed five times per model, and run-to-run consistency was evaluated. Mean differences (LLM − clinician) with 95% confidence intervals (CIs) were estimated using nonparametric bootstrap resampling (10,000 iterations). Clinicians achieved a mean total score of 69.7 ± 7.9, whereas DeepSeek-R1 and ChatGPT-4.0 scored 97 and 95, respectively. The mean total score differences were + 27.3 points (95% CI 24.4–30.1) for DeepSeek-R1 and + 25.3 points (22.4–28.1) for ChatGPT 4.0. Both models outperformed clinicians in CR, FU, BM, and ED. Run-to-run agreement was high (DeepSeek-R1 κ = 0.73; ChatGPT 4.0 κ = 0.76). LLMs substantially outperformed clinicians in knowledge- and decision-based tasks while approaching clinician-level performance in CR. These findings suggest that LLMs may complement clinical expertise and enhance diagnostic consistency across hospital levels.

Share this book

Add to My Shelf

Publisher

Nature Publishing Group UK,Nature Publishing Group,Nature Portfolio

Subject

631/114

/ 692/308

/ 692/700

/ Accuracy

/ Adult

/ Artificial intelligence

/ Benchmarking