Asset Details
MbrlCatalogueTitleDetail
Do you wish to reserve the book?
Performance Evaluation of GPT-5, Grok 4, and DeepSeek R1 in Interpreting Complete Blood Count Reports for Hematologic Diseases: Retrospective Comparative Study
by
Zhou, Suming
, Ren, Chunyun
, Fan, Lina
, Yu, Qian
, Qi, Xinglun
, Ye, Xianfei
, Yang, Dagan
in
Analysis
/ Applications of AI
/ Artificial Intelligence
/ Blood Cell Count
/ Blood diseases
/ Clinical Information and Decision Making
/ Decision-making
/ Evidence-based medicine
/ Foundation Models and Their Applications in AI
/ Generative Artificial Intelligence
/ Generative Language Models Including ChatGPT
/ Hematologic Diseases - blood
/ Hematologic Diseases - diagnosis
/ Humans
/ Large Language Models
/ Medical care
/ Medical errors
/ Medical research
/ Medicine, Experimental
/ Original Paper
/ Quality management
/ Reproducibility of Results
/ Research Instruments, Questionnaires, and Tools
/ Retrospective Studies
2026
Hey, we have placed the reservation for you!
By the way, why not check out events that you can attend while you pick your title.
You are currently in the queue to collect this book. You will be notified once it is your turn to collect the book.
Oops! Something went wrong.
Looks like we were not able to place the reservation. Kindly try again later.
Are you sure you want to remove the book from the shelf?
Performance Evaluation of GPT-5, Grok 4, and DeepSeek R1 in Interpreting Complete Blood Count Reports for Hematologic Diseases: Retrospective Comparative Study
by
Zhou, Suming
, Ren, Chunyun
, Fan, Lina
, Yu, Qian
, Qi, Xinglun
, Ye, Xianfei
, Yang, Dagan
in
Analysis
/ Applications of AI
/ Artificial Intelligence
/ Blood Cell Count
/ Blood diseases
/ Clinical Information and Decision Making
/ Decision-making
/ Evidence-based medicine
/ Foundation Models and Their Applications in AI
/ Generative Artificial Intelligence
/ Generative Language Models Including ChatGPT
/ Hematologic Diseases - blood
/ Hematologic Diseases - diagnosis
/ Humans
/ Large Language Models
/ Medical care
/ Medical errors
/ Medical research
/ Medicine, Experimental
/ Original Paper
/ Quality management
/ Reproducibility of Results
/ Research Instruments, Questionnaires, and Tools
/ Retrospective Studies
2026
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
Do you wish to request the book?
Performance Evaluation of GPT-5, Grok 4, and DeepSeek R1 in Interpreting Complete Blood Count Reports for Hematologic Diseases: Retrospective Comparative Study
by
Zhou, Suming
, Ren, Chunyun
, Fan, Lina
, Yu, Qian
, Qi, Xinglun
, Ye, Xianfei
, Yang, Dagan
in
Analysis
/ Applications of AI
/ Artificial Intelligence
/ Blood Cell Count
/ Blood diseases
/ Clinical Information and Decision Making
/ Decision-making
/ Evidence-based medicine
/ Foundation Models and Their Applications in AI
/ Generative Artificial Intelligence
/ Generative Language Models Including ChatGPT
/ Hematologic Diseases - blood
/ Hematologic Diseases - diagnosis
/ Humans
/ Large Language Models
/ Medical care
/ Medical errors
/ Medical research
/ Medicine, Experimental
/ Original Paper
/ Quality management
/ Reproducibility of Results
/ Research Instruments, Questionnaires, and Tools
/ Retrospective Studies
2026
Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy
We have requested the book for you!
Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.
Oops! Something went wrong.
Looks like we were not able to place your request. Kindly try again later.
Performance Evaluation of GPT-5, Grok 4, and DeepSeek R1 in Interpreting Complete Blood Count Reports for Hematologic Diseases: Retrospective Comparative Study
Journal Article
Performance Evaluation of GPT-5, Grok 4, and DeepSeek R1 in Interpreting Complete Blood Count Reports for Hematologic Diseases: Retrospective Comparative Study
2026
Request Book From Autostore
and Choose the Collection Method
Overview
Large language models (LLMs) demonstrate potential in the laboratory, yet rigorous clinical evaluation remains limited. The opacity of LLM decision-making constrains their safe application in interpreting complete blood count (CBC) reports for hematologic diseases.
This study aimed to conduct an exploratory evaluation of GPT-5, Grok 4, and DeepSeek R1 in interpreting real-world CBC reports, particularly their reasoning capabilities and clinical safety.
This single-center retrospective study analyzed 100 CBC reports from initial-visit patients with hematologic conditions. After responses were generated by the 3 LLMs using standardized Chinese prompts, four trained laboratory physicians blindly evaluated them across 6 quality and 5 task dimensions. Interrater reliability was assessed using intraclass correlation coefficients (ICCs), and performance differences were assessed based on 4-rater consensus scores and Friedman and Wilcoxon tests. For task 4 (ablation analysis), the McNemar test was used to compare top-1 diagnostic concordance with the gold-standard diagnosis within each model, with and without initial clinical suspicion in the prompt. Error types and distributions were documented during the task evaluation.
DeepSeek R1 demonstrated excellent interrater reliability across most quality dimensions (ICC ≥0.75). In the quality dimension, DeepSeek R1 significantly outperformed the other models in comprehensiveness, accuracy, clarity, relevance, and practicality. In the task 4 evaluation, GPT-5 demonstrated the highest concordance (93/100, 93%) with gold-standard diagnoses, followed by DeepSeek R1 (92/100, 92%) and Grok 4 (89/100, 89%). After removing the initial clinical suspicion, these rates decreased to 79% (79/100), 77% (77/100), and 72% (72/100), representing statistically significant within-model reductions for all models (P<.001). Post hoc error analysis revealed distinct patterns across task dimensions. GPT-5 exhibited 12 hallucinations in the analyzer alert processing task; DeepSeek R1 demonstrated 1 hallucination in the abnormal item identification task, whereas Grok 4 displayed none. All models exhibited reasoning errors and varying degrees of deficiencies in the correlation analysis and preliminary diagnosis tasks, characterized by unwarranted inferences of disease status from isolated results without clinical integration. Grok 4 generated 9 reasoning errors in the clinical management task by providing generic recommendations not tailored to case-specific CBC data, potentially compromising individualized treatment decisions.
While current LLMs demonstrate potential for interpreting CBC reports in hematologic diseases, they show performance heterogeneity across models. The ablation study findings underscore the necessity of integrating clinical context for accurate laboratory test interpretation. Low scores, hallucinations, and reasoning errors in model outputs indicate that current clinical deployment requires human oversight and quality control. As this single-center, Chinese-language exploratory assessment provides only preliminary, possibly context-dependent evidence, multicenter, cross-lingual prospective validation is needed to delineate the practical boundaries and safety standards for clinical deployment.
Publisher
Journal of Medical Internet Research,JMIR Publications Inc,JMIR Publications
Subject
/ Clinical Information and Decision Making
/ Foundation Models and Their Applications in AI
/ Generative Artificial Intelligence
/ Generative Language Models Including ChatGPT
/ Hematologic Diseases - blood
/ Hematologic Diseases - diagnosis
/ Humans
This website uses cookies to ensure you get the best experience on our website.