Catalogue Search | MBRL

Performance of ChatGPT and Microsoft Copilot in Bing in answering obstetric ultrasound questions and analyzing obstetric ultrasound reports

by Du, Yanran , Zhou, JianQiao , Xu, Jiale in 692/700 , 692/700/1421/1860 , ChatGPT

2025

To evaluate and compare the performance of publicly available ChatGPT-3.5, ChatGPT-4.0 and Microsoft Copilot in Bing (Copilot) in answering obstetric ultrasound questions and analyzing obstetric ultrasound reports. Twenty questions related to obstetric ultrasound were answered and 110 obstetric ultrasound reports were analyzed by ChatGPT-3.5, ChatGPT-4.0 and Copilot, with each question and report being posed three times to them at different times. The accuracy and consistency of each response to twenty questions and each analysis result in the report were evaluated and compared. In answering twenty questions, both ChatGPT-3.5 and ChatGPT-4.0 outperformed Copilot in accuracy (95.0% vs. 80.0%) and consistency (90.0% and 85.0% vs. 75.0%). However, no statistical difference was found among them. When analyzing obstetric ultrasound reports, ChatGPT-3.5 and ChatGPT-4.0 demonstrated superior accuracy compared to Copilot ( P < 0.05), and all three showed high consistency and the ability to provide recommendations. The overall accuracy and consistency of ChatGPT-3.5, ChatGPT-4.0, and Copilot were 83.86%, 84.13% vs. 77.51% in accuracy, and 87.30%, 93.65% vs. 90.48% in consistency, respectively. These large language models (ChatGPT-3.5, ChatGPT-4.0 and Copilot) have the potential to assist clinical workflows by enhancing patient education and patient clinical communication around common obstetric ultrasound issues. With inconsistent and sometimes inaccurate responses, along with cybersecurity concerns, physician supervision is crucial in the use of these models.

Journal Article

Share this book

Add to My Shelf

Large Language Models in Cellulose Biopolymer Studies: Evaluating ChatGPT and Microsoft Copilot for Information and Reference Accuracy

by Rahman, Tanmay , Ahmed, Shoeb , Kasera, Nitesh Kumar in Accuracy , Biopolymers , Cellulose

2026

With the increasing reliance on large language models (LLMs) for scientific research, it is critical to assess their reliability in specialized fields such as biopolymer science, particularly with respect to the verifiability of the references. This study examines the performance of two widely used LLMs, ChatGPT (GPT‐4 omni) and Microsoft Copilot (GPT‐4), in responding to questions and citing the references for the answers on cellulose biopolymers. The questions are set based on three cognitive levels: beginner, intermediate, and expert, and the accuracy of the responses and references provided by the models are assessed. Results show that ChatGPT outperforms Copilot in all cognitive levels, particularly in addressing mathematical problems. ChatGPT achieves 91.9%, 91.9%, and 88.6% accuracy at the beginner, intermediate, and expert levels, respectively, whereas Copilot achieves 82.4%, 82.5%, and 71.1% accuracy. However, the analysis of references reveals critical shortcomings in both models. While Copilot tends to cite more recent journal articles, ChatGPT often relies on older ones. In many cases, references are incomplete, fabricated, or lack proper context, highlighting the persistent challenge of verifying AI‐generated citations. Overall, Copilot outperforms ChatGPT in providing correct references. Results show that ChatGPT provides 6%, 36%, and 45% fabricated references at the beginner, intermediate, and expert levels, respectively, whereas Copilot delivers 13%, 9%, and 7% fabricated references at these levels. This work emphasizes that while LLMs hold promise in supporting scientific inquiry in biopolymers, their current limitations in response accuracy and citation reliability need to be addressed before they can serve as dependable tools for scholarly work.

Journal Article

Share this book

Add to My Shelf

Comparative analysis of AI chatbot (ChatGPT-4.0 and Microsoft Copilot) and expert responses to common orthodontic questions: patient and orthodontist evaluations

by Salmanpour, Farhad , Camcı, Hasan , Geniş, Ömer in Accuracy , Adult , Artificial Intelligence

2025

Objective The aim of this study was to evaluate the adequacy of responses provided by experts and artificial intelligence-based chatbots (ChatGPT-4.0 and Microsoft Copilot) to frequently asked orthodontic questions, utilizing scores assigned by patients and orthodontists. Methods Fifteen questions were randomly selected from the FAQ section of the American Association of Orthodontists (AAO) website, addressing common concerns related to orthodontic treatments, patient care, and post-treatment guidelines. Expert responses, along with those from ChatGPT-4.0 and Microsoft Copilot, were presented in a survey format via Google Forms. Fifty-two orthodontists and 102 patients rated the three responses for each question on a scale from 1 (least adequate) to 10 (most adequate). The findings were analyzed comparatively within and between groups. Results Expert responses consistently received the highest scores from both patients and orthodontists, particularly in critical areas such as Questions 1, 2, 4, 9, and 11, where they significantly outperformed chatbots ( P < 0.05). Patients generally rated expert responses higher than those of chatbots, underscoring the reliability of clinical expertise. However, ChatGPT-4.0 showed competitive performance in some questions, achieving its highest score in Question 14 (8.16 ± 1.24), but scored significantly lower than experts in several key areas ( P < 0.05). Microsoft Copilot generally received the lowest scores, although it demonstrated statistically comparable performance to other groups in certain questions, such as Questions 3 and 12 ( P > 0.05). Conclusions Overall, the scores for ChatGPT-4.0 and Microsoft Copilot were deemed acceptable (6.0 and above). However, both patients and orthodontists generally rated the expert responses as more adequate. This suggests that current current chatbots does not yet match the theoretical adequacy of expert opinions.

Journal Article

Share this book

Add to My Shelf

Can artificial intelligence pass the test? Evaluating chatbot scores on pediatric gastroenterology board‐style questions

by Engelhard, Matthew M. , Patel, Reshma , Greenberg, Rachel G. in Artificial intelligence , Chatbots , ChatGPT‐4o

2026

Objectives The American Academy of Pediatrics (AAP) Pediatrics Review and Education Program (PREP)® Gastroenterology (GI) Self‐Assessments help pediatric gastroenterologists and trainees prepare for subspecialty board exams by providing peer‐reviewed questions and critiques based on American Board of Pediatrics content specifications. These assessments test knowledge of material aligned with the pediatric gastroenterology board exams. While artificial intelligence (AI) chatbots have passed various medical board exams, their ability to pass the pediatric GI boards remains untested. This study assesses the performance of Microsoft Copilot and OpenAI ChatGPT‐3.5 and 4o on the 2022‐2024 AAP PREP® GI Self‐Assessments. Methods A total of 216 AAP PREP® GI Self‐Assessment questions from 2022 to 2024 were entered into three AI chatbots (Microsoft Copilot, OpenAI ChatGPT‐3.5, and ChatGPT‐4o). Scores were compared with the passing score (> 65%) and first‐time test takers' scores from the AAP for 2022–2024. Results OpenAI ChatGPT‐4o and Microsoft Copilot scored above 65% (pass) on all three PREP® GI Self‐Assessments from 2022 to 2024. OpenAI ChatGPT‐3.5 passed the 2023 and 2024 assessments but did not pass the 2022 assessment. The chatbots collectively scored best in anatomy, motility, and mouth and esophageal disorders, and scored poorly in physiology, pharmacology, liver, stomach and duodenum disorders. Conclusions OpenAI ChatGPT‐4o and Microsoft Copilot consistently passed the PREP® GI Self‐Assessments from 2022 to 2024, showing potential for good performance on the pediatric GI boards. OpenAI ChatGPT‐3.5 had limitations, passing only the 2023 and 2024 assessments. Overall, advanced AI chatbots show potential to pass the Pediatric GI board exam. What is Known Artificial intelligence (AI) chatbots have passed various medical board exams. Their ability to pass the pediatric gastroenterology boards or board‐style questions has not yet been evaluated. What is New OpenAI ChatGPT‐4o and Microsoft Copilot passed the 2022–2024 Pediatrics Review and Education Program (PREP)® Gastroenterology Self‐Assessments. OpenAI ChatGPT‐3.5 showed limitations, passing only the 2023 and 2024 assessments. These AI chatbots showed potential to pass the pediatric gastroenterology board exam.

Journal Article

Share this book

Add to My Shelf

Assessing The Performance of Artificial Intelligence Models In Autism Spectrum Disorder: Accuracy and Readability of ChatGPT, Gemini, and Microsoft Copilot

by Calişkan, Yasin , Haşimoğlu, Abas in Accuracy , Artificial intelligence , Autism

2026

Objective: Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder affecting social communication and involving restricted, repetitive behaviors. With AI tools like ChatGPT-4, Gemini, and Microsoft Copilot becoming increasingly popular information sources for healthcare professionals and families, this study aimed to evaluate and compare their accuracy and readability when responding to ASDrelated questions. Methods: In this cross-sectional study, we presented 88 questions (45 Frequently Asked Questions [FAQs] and 43 guideline-based) to the three AI models. We sourced questions from social media, parent forums, and clinical guidelines. Two blinded child psychiatrists evaluated response accuracy using a four-grade scale, while readability was assessed using four established indices: Flesch-Kincaid Grade Level, Gunning Fog Index, Coleman-Liau Index, and Flesch Reading Ease. Results: For FAQs, accuracy rates showed significant differences (p=0.001): Gemini (100%), ChatGPT-4 (95.6%), and Microsoft Copilot (71.1%). For guideline-based questions, accuracy also varied significantly (p=0.010): Gemini (86.0%), ChatGPT-4 (83.7%), and Microsoft Copilot (55.8%). Interestingly, Microsoft Copilot provided the most readable FAQ responses, while Gemini offered the most balanced readability for guideline-based questions. Conclusion: Our findings show that Gemini and ChatGPT-4 are highly accurate for ASD information, particularly for complex scientific content, while Microsoft Copilot produced more accessible text despite lower accuracy. These results suggest different models may better serve different audiences—healthcare professionals might benefit from Gemini or ChatGPT-4\\'s precision, while general users might prefer Copilot\\'s readability, highlighting opportunities for improving both reliability and accessibility in healthcare communication.

Journal Article

Share this book

Add to My Shelf

AI in Home Care-Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study

by Matarredona, Valerie , Tella, Susanna , Pérez-Esteve, Clara in Adults , Aged , Aging

2025

The aging population presents an accomplishment for society but also poses significant challenges for governments, health care systems, and caregivers. Elevated rates of functional limitations among older adults, primarily caused by chronic conditions, necessitate adequate and safe care, including in-home settings. Traditionally, informal caregiver training has relied on verbal and written instructions. However, the advent of digital resources has introduced videos and interactive platforms, offering more accessible and effective training. Large language models (LLMs) have emerged as potential tools for personalized information delivery. While LLMs exhibit the capacity to mimic clinical reasoning and support decision-making, their potential to serve as alternatives to evidence-based professional instruction remains unexplored. We aimed to evaluate the appropriateness of home care instructions generated by LLMs (including GPTs) in comparison to a professional gold standard. Furthermore, it seeks to identify specific domains where LLMs show the most promise and where improvements are necessary to optimize their reliability for caregiver training. An observational, comparative case study evaluated 3 LLMs-GPT-3.5, GPT-4o, and Microsoft Copilot-in 10 home care scenarios. A rubric assessed the models against a reference standard (gold standard) created by health care professionals. Independent reviewers evaluated variables including specificity, clarity, and self-efficacy. In addition to comparing each LLM to the gold standard, the models were also compared against each other across all study domains to identify relative strengths and weaknesses. Statistical analyses compared LLMs performance to the gold standard to ensure consistency and validity, as well as to analyze differences between LLMs across all evaluated domains. The study revealed that while no LLM achieved the precision of the professional gold standard, GPT-4o outperformed GPT-3.5, and Copilot in specificity (4.6 vs 3.7 and 3.6), clarity (4.8 vs 4.1 and 3.9), and self-efficacy (4.6 vs 3.8 and 3.4). However, the models exhibited significant limitations, with GPT-4o and Copilot omitting relevant details in 60% (6/10) of the cases, and GPT-3.5 doing so in 80% (8/10). When compared to the gold standard, only 10% (2/20) of GPT-4o responses were rated as equally specific, 20% (4/20) included comparable practical advice, and just 5% (1/20) provided a justification as detailed as professional guidance. Furthermore, error frequency did not differ significantly across models (P=.65), though Copilot had the highest rate of incorrect information (20%, 2/10 vs 10%, 1/10 for GPT-4o and 0%, 0/0 for GPT-3.5). LLMs, particularly GPT-4o subscription-based, show potential as tools for training informal caregivers by providing tailored guidance and reducing errors. Although not yet surpassing professional instruction quality, these models offer a flexible and accessible alternative that could enhance home safety and care quality. Further research is necessary to address limitations and optimize their performance. Future implementation of LLMs may alleviate health care system burdens by reducing common caregiver errors.

Journal Article

Share this book

Add to My Shelf

Evaluating the evidence-based potential of six large language models in paediatric dentistry: a comparative study on generative artificial intelligence

by Makrygiannakis, Miltiadis A. , Kaklamanos, Eleftherios G. , Arhakis, Aristidis in Accuracy , Algorithms , Artificial Intelligence

2025

Purpose The use of large language models (LLMs) in generative artificial intelligence (AI) is rapidly increasing in dentistry. However, their reliability is yet to be fully founded. This study aims to evaluate the diagnostic accuracy, clinical applicability, and patient education potential of LLMs in paediatric dentistry, by evaluating the responses of six LLMs: Google AI’s Gemini and Gemini Advanced, OpenAI’s ChatGPT-3.5, -4o and -4, and Microsoft’s Copilot. Methods Ten open-type clinical questions, relevant to paediatric dentistry were posed to the LLMs. The responses were graded by two independent evaluators from 0 to 10 using a detailed rubric . After 4 weeks, answers were reevaluated to assess intra-evaluator reliability. Statistical comparisons used Friedman’s and Wilcoxon’s and Kruskal–Wallis tests to assess the model that provided the most comprehensive, accurate, explicit and relevant answers. Results Variations of results were noted. Chat GPT 4 answers were scored as the best (average score 8.08), followed by the answers of Gemini Advanced (8.06), ChatGPT 4o (8.01), ChatGPT 3.5 (7.61), Gemini (7,32) and Copilot (5.41). Statistical analysis revealed that Chat GPT 4 outperformed all other LLMs, and the difference was statistically significant. Despite variations and different responses to the same queries, remarkable similarities were observed. Except for Copilot, all chatbots managed to achieve a score level above 6.5 on all queries. Conclusion This study demonstrates the potential use of language models (LLMs) in supporting evidence-based paediatric dentistry. Nevertheless, they cannot be regarded as completely trustworthy. Dental professionals should critically use AI models as supportive tools and not as a substitute of overall scientific knowledge and critical thinking.

Journal Article

Share this book

Add to My Shelf

Predictions from Generative Artificial Intelligence Models: Towards a New Benchmark in Forecasting Practice

by Hassani, Hossein , Silva, Emmanuel Sirimal in Accuracy , Analysis , Artificial intelligence

2024

This paper aims to determine whether there is a case for promoting a new benchmark for forecasting practice via the innovative application of generative artificial intelligence (Gen-AI) for predicting the future. Today, forecasts can be generated via Gen-AI models without the need for an in-depth understanding of forecasting theory, practice, or coding. Therefore, using three datasets, we present a comparative analysis of forecasts from Gen-AI models against forecasts from seven univariate and automated models from the forecast package in R, covering both parametric and non-parametric forecasting techniques. In some cases, we find statistically significant evidence to conclude that forecasts from Gen-AI models can outperform forecasts from popular benchmarks like seasonal ARIMA, seasonal naïve, exponential smoothing, and Theta forecasts (to name a few). Our findings also indicate that the accuracy of forecasts from Gen-AI models can vary not only based on the underlying data structure but also on the quality of prompt engineering (thus highlighting the continued importance of forecasting education), with the forecast accuracy appearing to improve at longer horizons. Therefore, we find some evidence towards promoting forecasts from Gen-AI models as benchmarks in future forecasting practice. However, at present, users are cautioned against reliability issues and Gen-AI being a black box in some cases.

Journal Article

Share this book

Add to My Shelf

User-centric AI: evaluating the usability of generative AI applications through user reviews on app stores

by Alabduljabbar, Reham in Artificial intelligence , Computational linguistics , Computer software industry

2024

This article presents a usability evaluation and comparison of generative AI applications through the analysis of user reviews from popular digital marketplaces, specifically Apple’s App Store and Google Play. The study aims to bridge the research gap in real-world usability assessments of generative AI tools. A total of 11,549 reviews were extracted and analyzed from January to March 2024 for five generative AI apps: ChatGPT, Bing AI, Microsoft Copilot, Gemini AI, and Da Vinci AI. The dataset has been made publicly available, allowing for further analysis by other researchers. The evaluation follows ISO 9241 usability standards, focusing on effectiveness, efficiency, and user satisfaction. This study is believed to be the first usability evaluation for generative AI applications using user reviews across digital marketplaces. The results show that ChatGPT achieved the highest compound usability scores among Android and iOS users, with scores of 0.504 and 0.462, respectively. Conversely, Gemini AI scored the lowest among Android apps at 0.016, and Da Vinci AI had the lowest among iOS apps at 0.275. Satisfaction scores were critical in usability assessments, with ChatGPT obtaining the highest rates of 0.590 for Android and 0.565 for iOS, while Gemini AI had the lowest satisfaction rate at −0.138 for Android users. The findings revealed usability issues related to ease of use, functionality, and reliability in generative AI tools, providing valuable insights from user opinions and feedback. Based on the analysis, actionable recommendations were proposed to enhance the usability of generative AI tools, aiming to address identified usability issues and improve the overall user experience. This study contributes to a deeper understanding of user experiences and offers valuable guidance for enhancing the usability of generative AI applications.

Journal Article

Share this book

Add to My Shelf

STUDYING IT EDUCATORS’ SATISFACTION WITH USING MICROSOFT COPILOT CHAT TO PERFORM PROFESSIONAL TASKS

by Spirin, Oleh , Osadcha, Kateryna , Osadchyi, Viacheslav in Effectiveness , Feasibility studies , Generative artificial intelligence

2025

This study aims to examine IT educators’ opinions on using Microsoft Copilot Chat for their professional tasks. The significance of this research lies in the increasing influence of generative AI technologies on learning and the necessity to evaluate their feasibility. The study employs an expert survey method based on a rating scale. 18 experts participated in it. The results indicate varying levels of satisfaction among experts with Microsoft Copilot Chat responses depending on the type of task. The highest-rated tasks were Trivia on a certain topic (4.67), unit test generation (4.50), optimise code (4.44), creating the content for slides on a certain topic (4.44), and creating a comparative table between different items (4.27).The tasks with the lowest ratings were creation of a logo for the conference (3.22), grading essays based on rubrics (3.17), identifying a logical fallacy in a particular article (3.00), convert the text in the image to a format that I can copy and paste (2.88), and creating a mind map to illustrate concepts (2.70).Therefore, using Microsoft Copilot Chat for these tasks with low ratings is not currently recommended. We used the SPSS Statistics suite to calculate Cronbach’s Alpha and Cronbach’s Alpha Based on Standardised Items. Based on the analysis of the experts’ responses, ratings were collected for each professional task for which a prompt was provided.The study’s practical significance lies in demonstrating to educators the capabilities of Microsoft Copilot Chat in performing their routine professional tasks. It has been particularly effective in several areas, including: administrative tasks (writing speeches, planning routes), assessment (developing tests, tasks for formative and summative assessment), communication (preparing information materials), lesson planning (generating ideas, creating graphic materials), programming assistance (explaining and optimising code), scientific activities (creating bibliographies, analysing articles), and others (e.g. playing intellectual games on the relevant topic). Future research opportunities are proposed, including the development of advanced training programs for IT educators on integrating AI into their professional practices and an examination of the effectiveness of these programs. Дослідження спрямоване на вивчення думок ІТ-викладачів щодо використання Microsoft Copilot Chat для виконання професійних завдань. Значущість дослідження обумовлена зростаючим впливом генеративних технологій штучного інтелекту на освіту та необхідність оцінки їх використання фахівцями.У дослідженні використовувався метод опитування експертів (18 осіб) з використанням шкали оцінювання. Результати вказують на різну ступінь задоволеності експертів відповідями Microsoft Copilot Chat залежно від типу завдань. Найвищу оцінку отримали такі завдання: вікторина на певну тему (4,67), генерація модульних тестів (4,50), оптимізація коду (4,44),створення контенту для слайдів на певну тему (4,44) та розроблення порівняльної таблиці між різними елементами (4,27). Найнижчими у рейтингу виявились завдання щодо створення логотипу для конференції (3,22), оцінювання есе на основі рубрик (3,17), виявлення логічної помилки в певній статті (3,00), перетворення тексту на зображенні у формат, який можливо скопіювати та вставити (2,88) та створення ментальної карти для ілюстрації концепцій (2,70).Отже, для такого роду завдань з низькою оцінкою наразі не рекомендовано використовувати Microsoft Copilot Chat. У дослідженні застосовувався пакет SPSS Statistics для розрахунку Cronbach’s Alpha та Cronbach’s Alpha Based on Standardised Items.На основі аналізу відповідей експертів було зібрано оцінки для кожного професійного завдання, для якого було запропоновано відповідний запит. Практичне значення дослідження полягає в демонстрації викладачам можливостей Microsoft Copilot Chat для виконання рутинних професійних завдань. Він виявився особливо ефективним у кількох сферах, зокрема: адміністративні завдання (написання промов, планування маршрутів), оцінювання (розробка тестів, завдань для формувального та підсумкового оцінювання), комунікація (підготовка інформаційних матеріалів), планування навчальних занять (генерування ідей, створення графічних матеріалів), допомога в програмуванні (пояснення та оптимізація коду), наукова діяльність (створення бібліографії, аналіз статей) та інші. Наведено перспективи подальших досліджень, що полягають у розробленні програм підвищення кваліфікації для ІТ-викладачів щодо використання ШІ в професійній діяльності, а також з’ясуванні ефективності таких програм.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter