Catalogue Search | MBRL

Large Language Models for Transforming Healthcare: A Perspective on DeepSeek‐R1

by Zhou, Jinsong , Chen, Yingcong , He, Sixu in Accuracy , AI for healthcare , AI interpretability

2025

DeepSeek‐R1 is an open‐source Large Language Model (LLM) with advanced reasoning capabilities. It has gained significant attention for its impressive advantages including low costs and visualized reasoning steps. Recent advancements in reasoning LLMs like ChatGPT‐o1 have significantly exhibited their considerable reasoning potential, but the closed‐source nature of existing models limits customization and transparency, presenting substantial barriers to their integration into healthcare systems. This gap motivates the exploration of DeepSeek‐R1 in the medical field. Thus, we comprehensively review the transformative potential, applications, and challenges of DeepSeek‐R1 in healthcare. Specifically, we investigate how DeepSeek‐R1 can enhance clinical decision support, patient engagement, and medical education to help for clinic, outpatient and medical research. Furthermore, we critically evaluate challenges including modality limitations (text‐only), hallucination risks, and ethical issues, particularly related to patient autonomy and safety‐focused recommendations. By assessing DeepSeek‐R1′s integration potential, this perspective highlights promising opportunities for advancing medical AI while emphasizing necessary improvements to maximize clinical reliability and ethical compliance. This paper provides valuable guidance for future research directions and elucidates practical application scenarios for DeepSeek‐R1′s successful integration into healthcare settings. This paper explores the potential of DeepSeek‐R1, an open‐source LLM with transparent reasoning and low deployment costs, especially for clinical decision support, patient engagement, and medical education. We highlight integration opportunities in healthcare while discussing challenges such as hallucinations, ethical concerns, and text‐only modality, offering guidance for future research directions and responsible adoption of reasoning LLMs in medical settings.

Journal Article

Share this book

Add to My Shelf

Preliminary evaluation of DeepSeek-R1 and GPT-5.3 in selected PET/CT clinical scenarios: patient preparation, report interpretation, and diagnostic reasoning

by Tianyue Li , Runze Duan , Lu Zheng in [18F]FDG PET/CT , artificial intelligence , Chatbot

2026

ObjectiveTo evaluate the performance of DeepSeek (R1 version), an open-source large language model, in three core clinical scenarios: answering patients’ common questions, interpreting PET/CT reports with follow-up inquiries, and diagnosing complex cases, and comparison with GPT-5.3, to verify the clinical applicability of DeepSeek-R1 as an alternative AI assistant.MethodsA total of 39 standardized tasks were assigned to both models, including responding to 15 frequently asked questions about [18F]FDG PET/CT, interpreting 12 anonymized reports of lung cancer and lymphoma (with follow-up inquiries regarding tumor staging or treatment), and providing primary and differential diagnoses for 10 difficult cases. Both models were accessed via their official platforms with default parameters, and all prompts and evaluation criteria were kept identical for cross-model comparison. Two senior nuclear medicine physicians independently rated the model responses using a 4-point standardized scale (assessing appropriateness, helpfulness, inter-trial consistency, and reference validity) and a binary scale for empathy; Cohen’s Kappa coefficient was used to evaluate inter-rater agreement. McNemar’s test was used to compare paired proportions of appropriateness, empathy, and response inconsistency between the two models.ResultsAcross the 39 tasks, DeepSeek-R1 achieved 94.9% appropriateness and 100% helpfulness. Specifically, 91.7% of responses to follow-up inquiries about tumor staging or treatment were rated empathetic. However, 7.7% of regenerated responses showed substantial inconsistencies, primarily in tumor staging, and only 37% of cited references were fully valid, with 11.1% being invalid. GPT-5.3 exhibited equivalent core performance to DeepSeek-R1 with 94.9% appropriateness and 100% helpfulness, a slightly lower substantial inconsistency rate (5.1%), favorable reference validity (33% fully valid, 7.4% invalid), but a notably lower empathy score (66.7%) for follow-up inquiries. McNemar tests showed identical appropriateness (p = 1.00) and no significant difference in inconsistency (p = 1.00, 95% CI 0.60–14.80) between models. DeepSeek-R1 had higher empathy, the difference was not significant (p = 0.25, 95% CI 0.09–0.66). For the 10 identical difficult cases, both models reached 10% primary diagnosis accuracy and 60% differential diagnosis accuracy.ConclusionDeepSeek-R1 and GPT-5.3 have complementary strengths but similar reference hallucination issues and cannot replace clinicians. DeepSeek-R1 is a cost-effective auxiliary tool, with future optimization needed for consistency, diagnostic accuracy and reference validity.

Journal Article

Share this book

Add to My Shelf

A Research Landscape of Agentic AI and Large Language Models: Applications, Challenges and Future Directions

by Mastoi, Qurat-ul-ain , Jhanjhi, N. Z. , Pillai, Thulasyammal Ramiah in Accountability , Agentic AI , Agentic artificial intelligence

2025

Agentic AI and Large Language Models (LLMs) are transforming how language is understood and generated while reshaping decision-making, automation, and research practices. LLMs provide underlying reasoning capabilities, and Agentic AI systems use them to perform tasks through interactions with external tools, services, and Application Programming Interfaces (APIs). Based on a structured scoping review and thematic analysis, this study identifies that core challenges of LLMs, relating to security, privacy and trust, misinformation, misuse and bias, energy consumption, transparency and explainability, and value alignment, can propagate into Agentic AI. Beyond these inherited concerns, Agentic AI introduces new challenges, including context management, security, privacy and trust, goal misalignment, opaque decision-making, limited human oversight, multi-agent coordination, ethical and legal accountability, and long-term safety. We analyse the applications of Agentic AI powered by LLMs across six domains: education, healthcare, cybersecurity, autonomous vehicles, e-commerce, and customer service, to reveal their real-world impact. Furthermore, we demonstrate some LLM limitations using DeepSeek-R1 and GPT-4o. To the best of our knowledge, this is the first comprehensive study to integrate the challenges and applications of LLMs and Agentic AI within a single forward-looking research landscape that promotes interdisciplinary research and responsible advancement of this emerging field.

Journal Article

Share this book

Add to My Shelf

Cognitive artificial intelligence for automated reservoir analysis and prediction of porosity, permeability, and fluid saturation

by Guilianno, Fossong , Okengwu, Kingsley Onyekwere , Okengwu, Ugochi Adaku in 639/166 , 639/4077 , 639/705

2026

This study develops a cognitive computing framework for reservoir characterisation and exploration optimisation in the Gabo Field, Niger Delta, with a focus on enhancing predictive accuracy for porosity, permeability, and fluid saturation. This research work aims to overcome the limitations related to geological complexity, shortage of available data, and complexity-related heterogeneity in the deltaic depositional environments. This work proposes a workflow that combines physics-based petrophysical calculations, Random Forest regression (RF), and Deepseek-R1 cognitive perception to improve predictive abilities, estimate uncertainties, and provide actionable intelligence related to reservoir management. Exploratory data analysis involved Akima spline interpolation and Isolation Forest algorithms. The RF model achieved superior predictive performance, with R² values above 0.98 for all predicted properties, and RMSE values below accepted thresholds. Predicted porosity values ranged between 0.18 and 0.25, clustering at 0.22–0.24, while permeability extended up to ~ 5230 mD, with several zones exceeding 500 mD, highlighting strong flow potential. Water saturation ranged between 0.25 and 0.45, suggesting favourable hydrocarbon saturation. Uncertainty quantification revealed low prediction errors (0.0062 v/v for porosity, 0.0040 log(mD) for permeability, and 0.0106 for saturation), confirming robustness and reliability. Deepseek-R1 cognitive evaluation identified potential bypassed pay zones and provided recommendations for enhanced recovery, including infill drilling and targeted waterflooding in high-permeability intervals. The integration of physics-based calculations, advanced machine learning, and cognitive computing demonstrates significant improvements in reservoir characterisation in geologically complex settings. This study delivers not only high predictive accuracy but also expert-level recommendations for exploration and field development. The proposed workflow contributes to digital intelligence in oil and gas exploration.

Journal Article

Share this book

Add to My Shelf

Benchmarking large language models against clinicians across hospital levels in cardiovascular decision-making: a cross-sectional vignette-based study

by Zhang, Zixi , Dai, Yongguo , Liu, Qiming in 631/114 , 692/308 , 692/700

2025

Large language models (LLMs) have showed strong performance on standardized medical examinations, yet their comparative clinical relevance against human clinicians remains limited. This study benchmarked the performance of DeepSeek-R1 and ChatGPT 4.0 against cardiovascular clinicians from different hospital levels in China. We conducted a cross-sectional, vignette-based assessment consisting of 100 standardized cardiovascular multiple-choice questions covering four competency domains: clinical reasoning (CR), frontier updates (FU), basic memory (BM), and emergency decision (ED). Thirty clinicians from six hospitals (three primary and three tertiary) were compared with two LLMs. Each question was executed five times per model, and run-to-run consistency was evaluated. Mean differences (LLM − clinician) with 95% confidence intervals (CIs) were estimated using nonparametric bootstrap resampling (10,000 iterations). Clinicians achieved a mean total score of 69.7 ± 7.9, whereas DeepSeek-R1 and ChatGPT-4.0 scored 97 and 95, respectively. The mean total score differences were + 27.3 points (95% CI 24.4–30.1) for DeepSeek-R1 and + 25.3 points (22.4–28.1) for ChatGPT 4.0. Both models outperformed clinicians in CR, FU, BM, and ED. Run-to-run agreement was high (DeepSeek-R1 κ = 0.73; ChatGPT 4.0 κ = 0.76). LLMs substantially outperformed clinicians in knowledge- and decision-based tasks while approaching clinician-level performance in CR. These findings suggest that LLMs may complement clinical expertise and enhance diagnostic consistency across hospital levels.

Journal Article

Share this book

Add to My Shelf

Evaluation of DeepSeek-R1 and ChatGPT-4o on the Chinese national medical licensing examination: a multi-year comparative study

by Cao, Yan , Long, Ziwen , Tang, Hanfei in 631/114 , 639/705 , 692/308

2026

Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and reasoning. However, their real-world applicability in high-stakes medical assessments remains underexplored, particularly in non-English contexts. This study aims to evaluate the performance of DeepSeek-R1 and ChatGPT-4o on the Chinese National Medical Licensing Examination (NMLE), a comprehensive benchmark of medical knowledge and clinical reasoning. We evaluated the performance of ChatGPT-4o and DeepSeek-R1 on the Chinese National Medical Licensing Examination (2019–2021) using question-level binary accuracy (correct = 1, incorrect = 0) as the outcome. A generalized linear mixed model (GLMM) with a binomial distribution and logit link was used to examine fixed effects of model type, year, and subject unit, including their interactions, while accounting for random intercepts across questions. Post hoc pairwise comparisons were conducted to assess differences across model–year interactions. DeepSeek-R1 significantly outperformed ChatGPT-4o overall (β = − 1.829, p < 0.001). Temporal analysis revealed a significant decline in ChatGPT-4o’s accuracy from 2019 to 2021 ( p < 0.05), whereas DeepSeek-R1 appeared to maintain a more stable performance. Subject-wise, Unit 3 showed the highest accuracy (β = 0.344, p = 0.001) compared to Unit 1. A significant interaction in 2020 (β = − 0.567, p = 0.009) indicated an amplified performance gap between the two models. These results highlight the importance of model selection and domain adaptation. Further investigation is needed to account for potential confounding factors, such as variations in question difficulty or language biases over time, which could also influence these trends. This longitudinal evaluation highlights the potential and limitations of LLMs in medical licensing contexts. While current models demonstrate promising results, further fine-tuning is necessary for clinical applicability. The NMLE offers a robust benchmark for future development of trustworthy AI-assisted medical decision support tools in non-English settings.

Journal Article

Share this book

Add to My Shelf

Exploring the use of large language models for classification, clinical interpretation, and treatment recommendation in breast tumor patient records

by Sun, Qian , Shao, Rongjun , Chen, Yuanlong in 639/705/117 , 692/4028/546 , Accuracy

2025

This study aims to investigate and compare the diagnostic performance, disease interpretation reliability, and treatment recommendation capabilities of multiple advanced large language models (GPT-4o, DeepSeek-R1, and DeepSeek-V3) in breast tumor cases. It retrospectively collected comprehensive clinical records of patients with breast tumors treated at Taizhou Cancer Hospital between January and April 2024. The study evaluated the accuracy of tumor classification (benign vs. malignant), the quality of disease interpretation, and the appropriateness of treatment recommendations generated by each model. To assess the clinical interpretability and utility of the models, a comprehensive performance analysis was conducted using statistical methods. A total of 45 patients with breast tumors were included, comprising 37 benign and 8 malignant cases (43 females, 2 males). GPT-4o achieved the highest area under the curve (AUC) for tumor classification (AUC = 0.848), outperforming DeepSeek-R1 (AUC = 0.736) and DeepSeek-V3 (AUC = 0.723). However, DeLong’s test indicated that the differences in AUCs among the models were not statistically significant ( p > 0.05). In addition, subjective evaluations by doctors indicated that DeepSeek-R1 received the highest scores for disease interpretation (4.73 ± 0.46) and treatment recommendations (4.70 ± 0.51), with consistent ratings.

Journal Article

Share this book

Add to My Shelf

Large language models could be applied in personalized out-of-hospital management for breast cancer: a prospective randomized single blind study

by Chen, Zikang , Zhou, Yulu , Lv, Xudong in 692/308 , 692/699/67 , 692/700

2025

Personalized out-of-hospital management could significantly improve quality of life of breast cancer patients. We aimed to evaluate the accuracy, effectiveness, safety, personalization and emotional care of Large Language Models (LLMs) in the out-of-hospital management of breast cancer. We established a data cleaning and classification pipeline to summarize three major scenarios of out-of-hospital management. Authentic electronic health record (EHR) datasets for data collection were generated using 10 patients with ID information masked from Breast Cancer Database in Affiliated Sir Run Run Shaw Hospital, Zhejiang University. Then we matched the EHR datasets with three out-of-hospital management scenarios as 100 virtual patients (VPs) for LLMs to perform the conversation generation using GPT-o3 and DeepSeek-R1. Further, we incorporated four human specialists to rate the responses of LLMs in five dimensions using Likert scale. As of April 1, 2025, the 4 evaluator specialists rated the conversations of LLMs and 100 VPs. The results demonstrate that both DS-R1and GPT-o3 performed well, with scores primarily concentrated at 3 and 4 points. We revealed statistically significant differences between DS-R1and GPT-o3 in accuracy, personalization, and emotional care ( P < 0.01). However, the P -values for effectiveness and safety were 0.231 and 0.086. Furthermore, DS-R1generated more tokens (approximately 1.8 times) in identical time with less economic cost, and it also had shorter response time than GPT-o3. GPT-o3 and DS-R1 demonstrated personalized, empathetic, and accurate performance in the out-of-hospital management for breast cancer patients. DS-R1 had better overall performance than GPT-o3, especially in personalization, emotional care and accuracy. More research is warranted in the development specific knowledge embedding LLMs to reduce the detractors like hallucinatory or verbose responses.

Journal Article

Share this book

Add to My Shelf

Evaluation of DeepSeek-R1 and its distilled models for performance and cost efficiency in oncology

by Liu, Fangcen , Zhu, Lijing , Xin, Kai

2025

Introduction: Malignant tumors represent a significant public health threat, and the integration of artificial intelligence in health care is increasingly becoming a priority. Many oncology institutions are already considering the use of DeepSeek-R1 to assist doctors in making complex medical decisions. However, there remains a lack of sufficient evidence regarding the accuracy, consistency, and cost-efficiency of DeepSeek-R1 and its distilled models in oncology decision-making. This study aims to fill this gap by evaluating the performance and cost-effectiveness of DeepSeek-R1 and its distilled models in oncology, providing critical insights into their potential for clinical integration. Objectives: This study aimed to systematically evaluate the performance, consistency, and cost-efficiency of the open-source large language model (LLM) DeepSeek-R1 and its distilled variants in the context of oncology decision-making, using a benchmark derived from the MedQA dataset. Methods: A custom oncology question set containing 1,206 multiple choice questions was curated from MedQA. Seven models, including DeepSeek-R1 and six distilled versions, were evaluated using an automated testing framework. Accuracy, consistency, latency, and token consumption were compared across models. Statistical tests, including McNemar and Wilcoxon signed-rank, were used to assess differences in performance. Questions were also categorized into clinical task types (diagnosis, treatment, triage, and follow-up) for subgroup analysis. Results: DeepSeek-R1 achieved the highest performance (accuracy: 91.38%; consistency: 90.47%), whereas DeepSeek-R1-Distill-Qwen-32B was the only distilled model to exceed both metrics at the 0.8 threshold (accuracy: 88.72%; consistency: 81.44%). DeepSeek-R1 demonstrated significantly higher accuracy than its distilled counterpart (p<0.05), particularly in diagnosis- and treatment-related tasks (p<0.05). However, it also exhibited significantly greater latency and token consumption. A Cohen's kappa value of 0.575 indicated moderate agreement between the two models. Conclusion: DeepSeek-R1 is more suitable for high-stakes oncology tasks requiring high accuracy and consistency, whereas DeepSeek-R1-Distill-Qwen-32B offers a cost-effective alternative for use in outpatient or resource-limited settings. These findings support a task- and resource-adaptive deployment strategy for LLMs in clinical oncology.

Journal Article

Share this book

Add to My Shelf

Comparative assessment of quality, consistency, and reference accuracy of MIH-related clinical information generated by ChatGPT-4o and DeepSeek R1

by Sarıoğlu, Derya , Uçar Gündoğar, Zübeyde in Artificial intelligence , Chatbots , ChatGPT-4o

2026

Background Molar incisor hypomineralization (MIH) is a clinically challenging developmental enamel defect that requires accurate diagnosis and nuanced management decisions. Although large language models (LLMs) are increasingly used as sources of clinical information in dentistry, the quality, consistency, and reference accuracy of their MIH-related explanations remain largely unexplored. To date, no study has systematically compared the clinical quality or citation accuracy of LLM-generated MIH information. This study comparatively evaluated how two widely used LLMs generate clinical information across repeated sessions. Methods Twenty open-ended MIH questions were developed in accordance with current clinical guidelines and organized into four categories: diagnosis, etiology, treatment, and differential diagnosis. ChatGPT-4o and DeepSeek R1 were each prompted with all questions during three independent sessions (morning, afternoon, evening) on the same day, generating a total of 120 responses. All responses were anonymized and evaluated by 24 calibrated pediatric dentists using the five-point Global Quality Scale (GQS). References provided by the models were independently verified by two reviewers and categorized as real or fabricated. Statistical analyses included Shapiro–Wilk tests, paired t-tests, Wilcoxon signed-rank tests, repeated-measures ANOVA, Friedman tests, Holm-corrected post-hoc comparisons, and ICC(2,1), with significance set at p < 0.05. Results Across all time points and all four MIH-related categories, DeepSeek R1 consistently achieved significantly higher GQS scores than ChatGPT-4o (all adjusted p < 0.005). Mean score differences ranged from + 0.36 to + 0.71, with the largest gap observed for etiology questions in the evening session. When overall scores were examined, DeepSeek R1 (4.44 ± 0.54) again outperformed ChatGPT-4o (3.99 ± 0.59) (t(23) = 11.83, p < 0.0001). Both models showed statistically significant but clinically small session-related variations, with acceptable reliability indicated by ICC values (0.72 for ChatGPT-4o; 0.77 for DeepSeek R1). Reference verification revealed notable fabrication rates in both models: ChatGPT-4o provided 46.5% real and 53.5% fake references, while DeepSeek R1 provided 34.2% real and 65.8% fake references. Conclusions DeepSeek R1 and ChatGPT-4o each demonstrated distinct strengths in generating MIH-related clinical explanations, with DeepSeek providing more detailed and context-focused responses and ChatGPT-4o producing clearer, more structured overviews. Although the score differences were modest, they reflect meaningful variations when applied to a condition as diagnostically complex as MIH. Both models showed acceptable temporal stability; however, their substantial rates of fabricated references underscore the need for careful expert oversight. Overall, while LLMs may support early learning, patient communication, or preliminary clinical orientation, neither model currently meets the accuracy or citation standards required for autonomous clinical use in pediatric dentistry.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter