Catalogue Search | MBRL

Large language models for generating medical examinations: systematic review

by Nadkarni, Girish , Klang, Eyal , Sorin, Vera in Analysis , Artificial intelligence , Authors

2024

Background Writing multiple choice questions (MCQs) for the purpose of medical exams is challenging. It requires extensive medical knowledge, time and effort from medical educators. This systematic review focuses on the application of large language models (LLMs) in generating medical MCQs. Methods The authors searched for studies published up to November 2023. Search terms focused on LLMs generated MCQs for medical examinations. Non-English, out of year range and studies not focusing on AI generated multiple-choice questions were excluded. MEDLINE was used as a search database. Risk of bias was evaluated using a tailored QUADAS-2 tool. Results Overall, eight studies published between April 2023 and October 2023 were included. Six studies used Chat-GPT 3.5, while two employed GPT 4. Five studies showed that LLMs can produce competent questions valid for medical exams. Three studies used LLMs to write medical questions but did not evaluate the validity of the questions. One study conducted a comparative analysis of different models. One other study compared LLM-generated questions with those written by humans. All studies presented faulty questions that were deemed inappropriate for medical exams. Some questions required additional modifications in order to qualify. Conclusions LLMs can be used to write MCQs for medical examinations. However, their limitations cannot be ignored. Further study in this field is essential and more conclusive evidence is needed. Until then, LLMs may serve as a supplementary tool for writing medical examinations. 2 studies were at high risk of bias. The study followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.

Journal Article

Share this book

Add to My Shelf

Advantages and pitfalls in utilizing artificial intelligence for crafting medical examinations: a medical education pilot study with GPT-4

by E, Klang , R, Kassif Lerner , V, Robinzon in Abdomen , Algorithms , Artificial Intelligence

2023

Background The task of writing multiple choice question examinations for medical students is complex, timely and requires significant efforts from clinical staff and faculty. Applying artificial intelligence algorithms in this field of medical education may be advisable. Methods During March to April 2023, we utilized GPT-4, an OpenAI application, to write a 210 multi choice questions-MCQs examination based on an existing exam template and thoroughly investigated the output by specialist physicians who were blinded to the source of the questions. Algorithm mistakes and inaccuracies, as identified by specialists were classified as stemming from age, gender or geographical insensitivities. Results After inputting a detailed prompt, GPT-4 produced the test rapidly and effectively. Only 1 question (0.5%) was defined as false; 15% of questions necessitated revisions. Errors in the AI-generated questions included: the use of outdated or inaccurate terminology, age-sensitive inaccuracies, gender-sensitive inaccuracies, and geographically sensitive inaccuracies. Questions that were disqualified due to flawed methodology basis included elimination-based questions and questions that did not include elements of integrating knowledge with clinical reasoning. Conclusion GPT-4 can be used as an adjunctive tool in creating multi-choice question medical examinations yet rigorous inspection by specialist physicians remains pivotal.

Journal Article

Share this book

Add to My Shelf

AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination

by So, Jerome , Lui, Chun Tat , Choi, Yu Fai in Adult , Artificial Intelligence , Blooms taxonomy

2025

Background The creation of high-quality multiple-choice questions (MCQs) is essential for medical education assessments but is resource-intensive and time-consuming when done by human experts. Large language models (LLMs) like ChatGPT-4o offer a promising alternative, but their efficacy remains unclear, particularly in high-stakes exams. Objective This study aimed to evaluate the quality and psychometric properties of ChatGPT-4o-generated MCQs compared to human-created MCQs in a high-stakes medical licensing exam. Methods A prospective cohort study was conducted among medical doctors preparing for the Primary Examination on Emergency Medicine (PEEM) organised by the Hong Kong College of Emergency Medicine in August 2024. Participants attempted two sets of 100 MCQs—one AI-generated and one human-generated. Expert reviewers assessed MCQs for factual correctness, relevance, difficulty, alignment with Bloom’s taxonomy (remember, understand, apply and analyse), and item writing flaws. Psychometric analyses were performed, including difficulty and discrimination indices and KR-20 reliability. Candidate performance and time efficiency were also evaluated. Results Among 24 participants, AI-generated MCQs were easier (mean difficulty index = 0.78 ± 0.22 vs. 0.69 ± 0.23, p < 0.01) but showed similar discrimination indices to human MCQs (mean = 0.22 ± 0.23 vs. 0.26 ± 0.26). Agreement was moderate (ICC = 0.62, p = 0.01, 95% CI: 0.12–0.84). Expert reviews identified more factual inaccuracies (6% vs. 4%), irrelevance (6% vs. 0%), and inappropriate difficulty levels (14% vs. 1%) in AI MCQs. AI questions primarily tested lower-order cognitive skills, while human MCQs better assessed higher-order skills (χ² = 14.27, p = 0.003). AI significantly reduced time spent on question generation (24.5 vs. 96 person-hours). Conclusion ChatGPT-4o demonstrates the potential for efficiently generating MCQs but lacks the depth needed for complex assessments. Human review remains essential to ensure quality. Combining AI efficiency with expert oversight could optimise question creation for high-stakes exams, offering a scalable model for medical education that balances time efficiency and content quality.

Journal Article

Share this book

Add to My Shelf

ChatGPT for generating multiple-choice questions: Evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam

by Kıyak, Yavuz Selim , Coşkun, Özlem , Uluoğlu, Canan in Artificial intelligence , Biomedical and Life Sciences , Biomedicine

2024

Purpose Artificial intelligence, specifically large language models such as ChatGPT, offers valuable potential benefits in question (item) writing. This study aimed to determine the feasibility of generating case-based multiple-choice questions using ChatGPT in terms of item difficulty and discrimination levels. Methods This study involved 99 fourth-year medical students who participated in a rational pharmacotherapy clerkship carried out based-on the WHO 6-Step Model. In response to a prompt that we provided, ChatGPT generated ten case-based multiple-choice questions on hypertension. Following an expert panel, two of these multiple-choice questions were incorporated into a medical school exam without making any changes in the questions. Based on the administration of the test, we evaluated their psychometric properties, including item difficulty, item discrimination (point-biserial correlation), and functionality of the options. Results Both questions exhibited acceptable levels of point-biserial correlation, which is higher than the threshold of 0.30 (0.41 and 0.39). However, one question had three non-functional options (options chosen by fewer than 5% of the exam participants) while the other question had none. Conclusions The findings showed that the questions can effectively differentiate between students who perform at high and low levels, which also point out the potential of ChatGPT as an artificial intelligence tool in test development. Future studies may use the prompt to generate items in order for enhancing the external validity of the results by gathering data from diverse institutions and settings.

Journal Article

Share this book

Add to My Shelf

Item analysis: the impact of distractor efficiency on the difficulty index and discrimination power of multiple-choice items

by Eleragi, Ali Mohammed Elhassan Seid Ahmed , Mohammed, Osama A. , Yahia, Amar Ibrahim Omer in Analysis , Answer Sheets , Cross-Sectional Studies

2024

Background Distractor efficiency (DE) of multiple-choice questions (MCQs) responses is a component of the psychometric analysis used by the examiners to evaluate the distractors’ credibility and functionality. This study was conducted to evaluate the impact of the DE on the difficulty and discrimination indices. Methods This cross-sectional study was conducted from April to June 2023. It utilizes the final exam of the Principles of Diseases Course with 45 s-year students. The exam consisted of 60 type A MCQs. Item analysis (IA) was generated to evaluate KR20, difficulty index (DIF), discrimination index (DIS), and distractor efficiency (DE). DIF was calculated as the percentage of examinees who scored the item correctly. DIS is an item’s ability to discriminate between higher and lower 27% of examinees. For DE, any distractor selected by less than 5% is considered nonfunctional, and items were classified according to the non-functional distractors. The correlation and significance of variance between DIF, DI, and DE were evaluated. Results The total number of examinees was 45. The KR-20 of the exam was 0.91. The mean (M), and standard deviation (SD) of the DIF of the exam was 37.5(19.1), and the majority (69.5%) were of acceptable difficulty. The M (SD) of the DIS was 0.46 (0.22), which is excellent. Most items were excellent in discrimination (69.5%), only two were not discriminating (13.6%), and the rest were of acceptable power (16.9%). Items with excellent and good efficiency represent 37.3% each, while only 3.4% were of poor efficiency. The correlation between DE and DIF ( p = 0.000, r = -0.548) indicates that items with efficient distractors (low number of NFD) are associated with those having a low difficulty index (difficult items) and vice versa. The correlation between DE and DIS is significantly negative ( P = 0.0476, r =-0.259). In such a correlation, items with efficient distractors are associated with low-discriminating items. Conclusions There is a significant moderate negative correlation between DE and DIF ( P = 0.00, r = -0.548) and a significant weak negative correlation between DE and DIS ( P = 0.0476, r = -0.259). DIF has a non-significant negative correlation with DIS ( P = 0.7124, r = -0.0492). DE impacts both DIF and DIS. Items with efficient distractors (low number of NFD) are associated with those having a low difficulty index (difficult items) and discriminating items. Improving the quality of DE will decrease the number of NFDs and result in items with acceptable levels of difficulty index and discrimination power.

Journal Article

Share this book

Add to My Shelf

The use of large language models in generating multiple choice questions for health professions education: A systematic review and network meta-analysis

by Riehm, Lauren , Pfeifer, Wesla , Lakhani, Moiz in Analysis , Chatbots , Data collection

2026

Large language models (LLMs) have the potential to change medical education. Whether LLMs can generate multiple-choice questions (MCQs) that are of similar quality to those created by humans is unclear. This investigation assessed the quality of MCQs generated by LLMs compared to humans. This review was registered with PROSPERO (CRD42025608775). A systematic review and frequentist random-effects network meta-analysis (NMA) or pairwise meta-analysis was performed. Ovid MEDLINE, Ovid EMBASE, and Scopus were searched from inception to November 1, 2024. The quality of MCQs was assessed with seven pre-defined outcomes: question relevance, clarity, accuracy/correctness; distractor quality; item difficulty analysis; and item discrimination analysis (point biserial correlation and item discrimination index). Continuous data were transformed to a 10-point scale to facilitate statistical analysis and reported as mean differences (MD). The MERSQI and the Grade of Recommendations, Assessment, Development and Evaluation (GRADE) NMA guidelines were used to assess risk of bias and certainty of evidence assessments. Five LLMs were included. NMA demonstrated that ChatGPT 4 generated similar quality MCQs to humans with regards to question relevance (MD -0.13; 95% CI: -0.44,0.18; GRADE: VERY LOW), question clarity (MD -0.03; 95% CI: -0.15,0.10; GRADE: VERY LOW), and distractor quality (MD -0.10; 95% CI: -0.24,0.04; GRADE: VERY LOW); however, MCQs generated by Llama 2 performed worse than humans with regards to question clarity (MD -1.21; 95% CI: -1.60,-0.82; GRADE: VERY LOW) and distractor quality (MD -1.50; 95% CI: -2.03,-0.97; GRADE: VERY LOW). Exploratory post-hoc t-tests demonstrated that ChatGPT 3.5 performed worse than Llama 2 and ChatGPT 4 with regards to question clarity and distractor quality (p < 0.001). ChatGPT 4 may create similar quality MCQs to humans, whereas ChatGPT 3.5 and Llama 2 may be of worse quality. Further studies that directly compare these LLMs to human-generated questions and administer MCQs to students are required.

Journal Article

Share this book

Add to My Shelf

Exploring how complex multiple-choice questions could contribute to inequity in introductory physics

by Mills, Mark , Bell, Eric F. , Hayward, Caitlin in Critical thinking , Educational Measurement - methods , Female

2025

High-stakes exams significantly impact introductory physics students' final grades and have been shown to be inequitable, often to the detriment of students identifying with groups historically marginalized in physics. Certain types of exam questions may contribute more than other types to the observed equity gaps. The primary objective of this study was to determine whether complex multiple-choice (CMC) questions may be a potential cause of inequity. We used four years of data from Problem Roulette, an online, not-for-credit exam preparation program, to address our objective. This data set included 951 Physics II (Electricity and Magnetism) questions, each of which we categorized as CMC or non-CMC. We then compared student performance on each question type and created a multi-level logistic regression model to control individual student and question differences. Students performed 7.9 percentage points worse on CMC questions than they did on non-CMC questions. We find minimal additional performance differences based on student performance in the course. The results from mixed-effects models suggest that CMC questions may be contributing to the observed equity gaps, especially for male and female students, though more evidence is needed. We found CMC questions are more difficult for everyone. Future research should examine the source of this difficulty and whether that source is functionally related to learning and assessment. Our data does not support using CMC questions instead of non-CMC questions as a way to differentiate top-performing students from everyone else.

Journal Article

Share this book

Add to My Shelf

Evaluating the multiple-choice questions quality at the College of Medicine, University of Bisha, Saudi Arabia: a three-year experience

by Alhalafi, Abdullah , Mohammed, Osama A. , Adam, Masoud I. E. in Answer Sheets , Correlation , Cross-Sectional Studies

2025

Background Assessment is a central tool that drives and shapes students learning. Multiple choice questions (MCQs) are central in medical education assessment because they evaluate knowledge across large cohorts. Good quality items will help to achieve the learning objectives and provide trustful results. This study aims to evaluate the quality of MCQs utilized in the final exams of the Principal of Diseases (PRD) course over three academic years at the College of Medicine at The University of Bisha, Saudi Arabia. Method This cross-sectional institutional-based study used the final exams from the PRD course for the academic years 2016–2019. It was conducted at the College of Medicine, University of Bisha (UBCOM), Saudi Arabia (SA). The analysis process used item analysis (IA) of the PRD final theoretical examinations of the 2016–2017, 2017–2018, and 2018–2019 academic years. 80, 70, and 60 MCQ items were used per test in the above-mentioned years, respectively (210 total). The IA targets the reliability (KR20), difficulty index (DIF), discrimination index (DI), and distractor effectiveness (DE). The generated data were analyzed using SPSS (version 25.0), and statistical significance was set at P < 0.05. Results The exams included 210 items. The reliability (KR20) ranged from 0.804 to 0.906. The DI indicated that 56.7% of items were excellent, 20.9% were good, 13.8% were poor, and 8.6% were defective. The DIF showed that 50.5% of items had acceptable difficulty, 37.6% were easy, and 11.9% were difficult. DE analysis revealed that 70.2% of distractors were functional, with a significant correlation between DI, DIF, and DE ( P < 0.05). Conclusion Most of the examined items exhibited excellent discrimination and acceptable difficulty, with 70.2% having functional distractors, categorizing them as high-quality and well-constructed items. The study accentuates the importance of continuous item analysis to maintain and improve the quality of assessment tools used in medical education.

Journal Article

Share this book

Add to My Shelf

Developing, Analyzing, and Using Distractors for Multiple-Choice Tests in Education: A Comprehensive Review

by Bulut, Okan , Zhang, Xinxin , Gierl, Mark J. in Accuracy , Achievement tests , Difficulty Level

2017

Multiple-choice testing is considered one of the most effective and enduring forms of educational assessment that remains in practice today. This study presents a comprehensive review of the literature on multiple-choice testing in education focused, specifically, on the development, analysis, and use of the incorrect options, which are also called the distractors. Despite a vast body of literature on multiple-choice testing, the task of creating distractors has received much less attention. In this study, we provide an overview of what is known about developing distractors for multiple-choice items and evaluating their quality. Next, we synthesize the existing guidelines on how to use distractors and summarize earlier research on the optimal number of distractors and the optimal ordering of distractors. Finally, we use this comprehensive review to provide the most up-to-date recommendations regarding distractor development, analysis, and use, and in the process, we highlight important areas where further research is needed.

Journal Article

Share this book

Add to My Shelf

Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study

by Saunders, Ramotse , Reilly, Erin , Krystal, Andrew in Accuracy , Accuracy and precision , Analysis

2025

Large language models (LLMs), such as OpenAI's GPT-3.5, GPT-4, and GPT-4o, have garnered early and significant enthusiasm for their potential applications within mental health, ranging from documentation support to chat-bot therapy. Understanding the accuracy and reliability of the psychiatric \"knowledge\" stored within the parameters of these models and developing measures of confidence in their responses (ie, the likelihood that an LLM response is accurate) are crucial for the safe and effective integration of these tools into mental health settings. This study aimed to assess the accuracy, reliability, and predictors of accuracy of GPT-3.5 (175 billion parameters), GPT-4 (approximately 1.8 trillion parameters), and GPT-4o (an optimized version of GPT-4 with unknown parameters) with standardized psychiatry multiple-choice questions (MCQs). A cross-sectional study was conducted where 3 commonly available, commercial LLMs (GPT-3.5, GPT-4, and GPT-4o) were tested for their ability to provide answers to single-answer MCQs (N=150) extracted from the Psychiatry Test Preparation and Review Manual. Each model generated answers to every MCQ 10 times. We evaluated the accuracy and reliability of the answers and sought predictors of answer accuracy. Our primary outcome was the proportion of questions answered correctly by each LLM (accuracy). Secondary measures were (1) response consistency to MCQs across 10 trials (reliability), (2) the correlation between MCQ answer accuracy and response consistency, and (3) the correlation between MCQ answer accuracy and model self-reported confidence. On the first attempt, GPT-3.5 answered 58.0% (87/150) of MCQs correctly, while GPT-4 and GPT-4o answered 84.0% (126/150) and 87.3% (131/150) correctly, respectively. GPT-4 and GPT-4o showed no difference in performance (P=.51), but they significantly outperformed GPT-3.5 (P<.001). GPT-3.5 exhibited less response consistency on average compared to the other models (P<.001). MCQ response consistency was positively correlated with MCQ accuracy across all models (r=0.340, 0.682, and 0.590 for GPT-3.5, GPT-4, and GPT-4o, respectively; all P<.001), whereas model self-reported confidence showed no correlation with accuracy in the models, except for GPT-3.5, where self-reported confidence was weakly inversely correlated with accuracy (P<.001). To our knowledge, this is the first comprehensive evaluation of the general psychiatric knowledge encoded in commercially available LLMs and the first study to assess their reliability and identify predictors of response accuracy within medical domains. The findings suggest that GPT-4 and GPT-4o encode accurate and reliable general psychiatric knowledge and that methods, such as repeated prompting, may provide a measure of LLM response confidence. This work supports the potential of LLMs in mental health settings and motivates further research to assess their performance in more open-ended clinical contexts.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter