Catalogue Search | MBRL

From human writing to artificial intelligence generated text: examining the prospects and potential threats of ChatGPT in academic writing

by Dergaa, Ismail , Ben Saad, Helmi , Zmijewski, Piotr in Artificial intelligence , Authenticity , chatbot

2023

Natural language processing (NLP) has been studied in computing for decades. Recent technological advancements have led to the development of sophisticated artificial intelligence (AI) models, such as Chat Generative Pre-trained Transformer (ChatGPT). These models can perform a range of language tasks and generate human-like responses, which offers exciting prospects for academic efficiency. This manuscript aims at (i) exploring the potential benefits and threats of ChatGPT and other NLP technologies in academic writing and research publications; (ii) highlights the ethical considerations involved in using these tools, and (iii) consider the impact they may have on the authenticity and credibility of academic work. This study involved a literature review of relevant scholarly articles published in peer-reviewed journals indexed in Scopus as quartile 1. The search used keywords such as \"ChatGPT,\" \"AI-generated text,\" \"academic writing,\" and \"natural language processing.\" The analysis was carried out using a quasi-qualitative approach, which involved reading and critically evaluating the sources and identifying relevant data to support the research questions. The study found that ChatGPT and other NLP technologies have the potential to enhance academic writing and research efficiency. However, their use also raises concerns about the impact on the authenticity and credibility of academic work. The study highlights the need for comprehensive discussions on the potential use, threats, and limitations of these tools, emphasizing the importance of ethical and academic principles, with human intelligence and critical thinking at the forefront of the research process. This study highlights the need for comprehensive debates and ethical considerations involved in their use. The study also recommends that academics exercise caution when using these tools and ensure transparency in their use, emphasizing the importance of human intelligence and critical thinking in academic work.

Journal Article

Share this book

Add to My Shelf

Comparison of Large Language Models in Answering Immuno-Oncology Questions: A Cross-Sectional Study

by Floudas, Charalampos S , Iannantuono, Giovanni Maria , Choo-Wosoba, Hyoyoung in Cancer , Chatbots , Cross-sectional studies

2024

Background The capability of large language models (LLMs) to understand and generate human-readable text has prompted the investigation of their potential as educational and management tools for patients with cancer and healthcare providers. Materials and Methods We conducted a cross-sectional study aimed at evaluating the ability of ChatGPT-4, ChatGPT-3.5, and Google Bard to answer questions related to 4 domains of immuno-oncology (Mechanisms, Indications, Toxicities, and Prognosis). We generated 60 open-ended questions (15 for each section). Questions were manually submitted to LLMs, and responses were collected on June 30, 2023. Two reviewers evaluated the answers independently. Results ChatGPT-4 and ChatGPT-3.5 answered all questions, whereas Google Bard answered only 53.3% (P < .0001). The number of questions with reproducible answers was higher for ChatGPT-4 (95%) and ChatGPT3.5 (88.3%) than for Google Bard (50%) (P < .0001). In terms of accuracy, the number of answers deemed fully correct were 75.4%, 58.5%, and 43.8% for ChatGPT-4, ChatGPT-3.5, and Google Bard, respectively (P = .03). Furthermore, the number of responses deemed highly relevant was 71.9%, 77.4%, and 43.8% for ChatGPT-4, ChatGPT-3.5, and Google Bard, respectively (P = .04). Regarding readability, the number of highly readable was higher for ChatGPT-4 and ChatGPT-3.5 (98.1%) and (100%) compared to Google Bard (87.5%) (P = .02). Conclusion ChatGPT-4 and ChatGPT-3.5 are potentially powerful tools in immuno-oncology, whereas Google Bard demonstrated relatively poorer performance. However, the risk of inaccuracy or incompleteness in the responses was evident in all 3 LLMs, highlighting the importance of expert-driven verification of the outputs returned by these technologies. This cross-sectional study assessed the ability of ChatGPT-4, ChatGPT-3.5, and Google Bard to answer questions related to immuno-oncology.

Journal Article

Share this book

Add to My Shelf

Large language models as assistance for glaucoma surgical cases: a ChatGPT vs. Google Gemini comparison

by Giannuzzi, Federico , Baldascino, Antonio , Rizzo, Stanislao in Aged , Breakthroughs in artificial intelligence for ophthalmology , Chatbots

2024

Purpose The aim of this study was to define the capability of ChatGPT-4 and Google Gemini in analyzing detailed glaucoma case descriptions and suggesting an accurate surgical plan. Methods Retrospective analysis of 60 medical records of surgical glaucoma was divided into “ordinary” ( n = 40) and “challenging” ( n = 20) scenarios. Case descriptions were entered into ChatGPT and Bard’s interfaces with the question “What kind of surgery would you perform?” and repeated three times to analyze the answers’ consistency. After collecting the answers, we assessed the level of agreement with the unified opinion of three glaucoma surgeons. Moreover, we graded the quality of the responses with scores from 1 (poor quality) to 5 (excellent quality), according to the Global Quality Score (GQS) and compared the results. Results ChatGPT surgical choice was consistent with those of glaucoma specialists in 35/60 cases (58%), compared to 19/60 (32%) of Gemini ( p = 0.0001). Gemini was not able to complete the task in 16 cases (27%). Trabeculectomy was the most frequent choice for both chatbots (53% and 50% for ChatGPT and Gemini, respectively). In “challenging” cases, ChatGPT agreed with specialists in 9/20 choices (45%), outperforming Google Gemini performances (4/20, 20%). Overall, GQS scores were 3.5 ± 1.2 and 2.1 ± 1.5 for ChatGPT and Gemini ( p = 0.002). This difference was even more marked if focusing only on “challenging” cases (1.5 ± 1.4 vs. 3.0 ± 1.5, p = 0.001). Conclusion ChatGPT-4 showed a good analysis performance for glaucoma surgical cases, either ordinary or challenging. On the other side, Google Gemini showed strong limitations in this setting, presenting high rates of unprecise or missed answers.

Journal Article

Share this book

Add to My Shelf

Evaluating the performance of large language models: ChatGPT and Google Bard in generating differential diagnoses in clinicopathological conferences of neurodegenerative disorders

by Martin, Nicholas B. , Koga, Shunsuke , Dickson, Dennis W. in Alzheimer's disease , Amyotrophic lateral sclerosis , Artificial Intelligence

2024

This study explores the utility of the large language models (LLMs), specifically ChatGPT and Google Bard, in predicting neuropathologic diagnoses from clinical summaries. A total of 25 cases of neurodegenerative disorders presented at Mayo Clinic brain bank Clinico‐Pathological Conferences were analyzed. The LLMs provided multiple pathologic diagnoses and their rationales, which were compared with the final clinical diagnoses made by physicians. ChatGPT‐3.5, ChatGPT‐4, and Google Bard correctly made primary diagnoses in 32%, 52%, and 40% of cases, respectively, while correct diagnoses were included in 76%, 84%, and 76% of cases, respectively. These findings highlight the potential of artificial intelligence tools like ChatGPT in neuropathology, suggesting they may facilitate more comprehensive discussions in clinicopathological conferences. This study assessed the capability of large language models, namely ChatGPT and Google Bard, in predicting neuropathologic diagnoses from 25 cases presented at Mayo Clinic brain bank clinicopathological conferences. ChatGPT‐4 rendered correct diagnoses in 84% of cases, whereas ChatGPT‐3.5 and Google Bard each achieved 76%. These findings highlight the potential of large language models in neuropathology, suggesting they may facilitate more comprehensive discussions in clinicopathological conferences.

Journal Article

Share this book

Add to My Shelf

Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study

by Kaklamanos, Eleftherios G , Aaqel Salim, Anas , Stamatopoulos, Vassilis in Accuracy , Answers , Artificial Intelligence

2023

The increasing application of generative artificial intelligence large language models (LLMs) in various fields, including dentistry, raises questions about their accuracy. This study aims to comparatively evaluate the answers provided by 4 LLMs, namely Bard (Google LLC), ChatGPT-3.5 and ChatGPT-4 (OpenAI), and Bing Chat (Microsoft Corp), to clinically relevant questions from the field of dentistry. The LLMs were queried with 20 open-type, clinical dentistry-related questions from different disciplines, developed by the respective faculty of the School of Dentistry, European University Cyprus. The LLMs' answers were graded 0 (minimum) to 10 (maximum) points against strong, traditionally collected scientific evidence, such as guidelines and consensus statements, using a rubric, as if they were examination questions posed to students, by 2 experienced faculty members. The scores were statistically compared to identify the best-performing model using the Friedman and Wilcoxon tests. Moreover, the evaluators were asked to provide a qualitative evaluation of the comprehensiveness, scientific accuracy, clarity, and relevance of the LLMs' answers. Overall, no statistically significant difference was detected between the scores given by the 2 evaluators; therefore, an average score was computed for every LLM. Although ChatGPT-4 statistically outperformed ChatGPT-3.5 (P=.008), Bing Chat (P=.049), and Bard (P=.045), all models occasionally exhibited inaccuracies, generality, outdated content, and a lack of source references. The evaluators noted instances where the LLMs delivered irrelevant information, vague answers, or information that was not fully accurate. This study demonstrates that although LLMs hold promising potential as an aid in the implementation of evidence-based dentistry, their current limitations can lead to potentially harmful health care decisions if not used judiciously. Therefore, these tools should not replace the dentist's critical thinking and in-depth understanding of the subject matter. Further research, clinical validation, and model improvements are necessary for these tools to be fully integrated into dental practice. Dental practitioners must be aware of the limitations of LLMs, as their imprudent use could potentially impact patient care. Regulatory measures should be established to oversee the use of these evolving technologies.

Journal Article

Share this book

Add to My Shelf

Artificial intelligence chatbots as sources of patient education material for obstructive sleep apnoea: ChatGPT versus Google Bard

by Randhawa, Premjit , Cheong, Ryan Chin Taw , Joseph, Jonathan in Artificial Intelligence , Head and Neck Surgery , Humans

2024

Purpose To perform the first head-to-head comparative evaluation of patient education material for obstructive sleep apnoea generated by two artificial intelligence chatbots, ChatGPT and its primary rival Google Bard. Methods Fifty frequently asked questions on obstructive sleep apnoea in English were extracted from the patient information webpages of four major sleep organizations and categorized as input prompts. ChatGPT and Google Bard responses were selected and independently rated using the Patient Education Materials Assessment Tool–Printable (PEMAT-P) Auto-Scoring Form by two otolaryngologists, with a Fellowship of the Royal College of Surgeons (FRCS) and a special interest in sleep medicine and surgery. Responses were subjectively screened for any incorrect or dangerous information as a secondary outcome. The Flesch-Kincaid Calculator was used to evaluate the readability of responses for both ChatGPT and Google Bard. Results A total of 46 questions were curated and categorized into three domains: condition ( n = 14), investigation ( n = 9) and treatment ( n = 23). Understandability scores for ChatGPT versus Google Bard on the various domains were as follows: condition 90.86% vs.76.32% ( p < 0.001); investigation 89.94% vs. 71.67% ( p < 0.001); treatment 90.78% vs.73.74% ( p < 0.001). Actionability scores for ChatGPT versus Google Bard on the various domains were as follows: condition 77.14% vs. 51.43% ( p < 0.001); investigation 72.22% vs. 54.44% ( p = 0.05); treatment 73.04% vs. 54.78% ( p = 0.002). The mean Flesch–Kincaid Grade Level for ChatGPT was 9.0 and Google Bard was 5.9. No incorrect or dangerous information was identified in any of the generated responses from both ChatGPT and Google Bard. Conclusion Evaluation of ChatGPT and Google Bard patient education material for OSA indicates the former to offer superior information across several domains.

Journal Article

Share this book

Add to My Shelf

Performance of artificial intelligence chatbots in sleep medicine certification board exams: ChatGPT versus Google Bard

by Pang, Kenny Peter , Randhawa, Premjit , Cheong, Ryan Chin Taw in Head and Neck Surgery , Medicine , Medicine & Public Health

2024

Purpose To conduct a comparative performance evaluation of GPT-3.5, GPT-4 and Google Bard in self-assessment questions at the level of the American Sleep Medicine Certification Board Exam. Methods A total of 301 text-based single-best-answer multiple choice questions with four answer options each, across 10 categories, were included in the study and transcribed as inputs for GPT-3.5, GPT-4 and Google Bard. The first output responses generated were selected and matched for answer accuracy against the gold-standard answer provided by the American Academy of Sleep Medicine for each question. A global score of 80% and above is required by human sleep medicine specialists to pass each exam category. Results GPT-4 successfully achieved the pass mark of 80% or above in five of the 10 exam categories, including the Normal Sleep and Variants Self-Assessment Exam (2021), Circadian Rhythm Sleep–Wake Disorders Self-Assessment Exam (2021), Insomnia Self-Assessment Exam (2022), Parasomnias Self-Assessment Exam (2022) and the Sleep-Related Movements Self-Assessment Exam (2023). GPT-4 demonstrated superior performance in all exam categories and achieved a higher overall score of 68.1% when compared against both GPT-3.5 (46.8%) and Google Bard (45.5%), which was statistically significant ( p value < 0.001). There was no significant difference in the overall score performance between GPT-3.5 and Google Bard. Conclusions Otolaryngologists and sleep medicine physicians have a crucial role through agile and robust research to ensure the next generation AI chatbots are built safely and responsibly .

Journal Article

Share this book

Add to My Shelf

Chatbots Put to the Test in Math and Logic Problems: A Comparison and Assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard

by Papazafeiropoulos, George , Plevris, Vagelis , Jiménez Rios, Alejandro in Accuracy , Algorithms , Artificial intelligence

2023

In an age where artificial intelligence is reshaping the landscape of education and problem solving, our study unveils the secrets behind three digital wizards, ChatGPT-3.5, ChatGPT-4, and Google Bard, as they engage in a thrilling showdown of mathematical and logical prowess. We assess the ability of the chatbots to understand the given problem, employ appropriate algorithms or methods to solve it, and generate coherent responses with correct answers. We conducted our study using a set of 30 questions. These questions were carefully crafted to be clear, unambiguous, and fully described using plain text only. Each question has a unique and well-defined correct answer. The questions were divided into two sets of 15: Set A consists of “Original” problems that cannot be found online, while Set B includes “Published” problems that are readily available online, often with their solutions. Each question was presented to each chatbot three times in May 2023. We recorded and analyzed their responses, highlighting their strengths and weaknesses. Our findings indicate that chatbots can provide accurate solutions for straightforward arithmetic, algebraic expressions, and basic logic puzzles, although they may not be consistently accurate in every attempt. However, for more complex mathematical problems or advanced logic tasks, the chatbots’ answers, although they appear convincing, may not be reliable. Furthermore, consistency is a concern as chatbots often provide conflicting answers when presented with the same question multiple times. To evaluate and compare the performance of the three chatbots, we conducted a quantitative analysis by scoring their final answers based on correctness. Our results show that ChatGPT-4 performs better than ChatGPT-3.5 in both sets of questions. Bard ranks third in the original questions of Set A, trailing behind the other two chatbots. However, Bard achieves the best performance, taking first place in the published questions of Set B. This is likely due to Bard’s direct access to the internet, unlike the ChatGPT chatbots, which, due to their designs, do not have external communication capabilities.

Journal Article

Share this book

Add to My Shelf

A Promising Start and Not a Panacea: ChatGPT's Early Impact and Potential in Medical Science and Biomedical Engineering Research

by Sohail, Shahab Saquib in Agents (artificial intelligence) , Artificial intelligence , Biochemistry

2024

The advent of artificial intelligence (AI) has catalyzed a revolutionary transformation across various industries, including healthcare. Medical applications of ChatGPT, a powerful language model based on the generative pre-trained transformer (GPT) architecture, encompass the creation of conversational agents capable of accessing and generating medical information from multiple sources and formats. This study investigates the research trends of large language models such as ChatGPT, GPT 4, and Google Bard, comparing their publication trends with early COVID-19 research. The findings underscore the current prominence of AI research and its potential implications in biomedical engineering. A search of the Scopus database on July 23, 2023, yielded 1,096 articles related to ChatGPT, with approximately 26% being medical science-related. Keywords related to artificial intelligence, natural language processing (NLP), LLM, and generative AI dominate ChatGPT research, while a focused representation of medical science research emerges, with emphasis on biomedical research and engineering. This analysis serves as a call to action for researchers, healthcare professionals, and policymakers to recognize and harness AI's potential in healthcare, particularly in the realm of biomedical research.

Journal Article

Share this book

Add to My Shelf