Asset Details
MbrlCatalogueTitleDetail
Do you wish to reserve the book?
Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: Mixed Methods Study
by
Hastings, Janna
, Strasser, Livia Maria
, Anschuetz, Wilma
, Dennstädt, Fabio
in
Artificial Intelligence
/ Artificial Intelligence (AI) in Medical Education
/ Chatbots and Conversational Agents
/ e-Learning and Digital Medical Education
/ Education, Medical - methods
/ Educational Measurement - methods
/ Educational Measurement - standards
/ eHealth Literacy / Digital Literacy
/ Generative Language Models Including ChatGPT
/ Germany
/ Humans
/ Language
/ Large Language Models
/ Multilingualism
/ Original Paper
/ Testing and Assessment in Medical Education
2026
Hey, we have placed the reservation for you!
By the way, why not check out events that you can attend while you pick your title.
You are currently in the queue to collect this book. You will be notified once it is your turn to collect the book.
Oops! Something went wrong.
Looks like we were not able to place the reservation. Kindly try again later.
Are you sure you want to remove the book from the shelf?
Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: Mixed Methods Study
by
Hastings, Janna
, Strasser, Livia Maria
, Anschuetz, Wilma
, Dennstädt, Fabio
in
Artificial Intelligence
/ Artificial Intelligence (AI) in Medical Education
/ Chatbots and Conversational Agents
/ e-Learning and Digital Medical Education
/ Education, Medical - methods
/ Educational Measurement - methods
/ Educational Measurement - standards
/ eHealth Literacy / Digital Literacy
/ Generative Language Models Including ChatGPT
/ Germany
/ Humans
/ Language
/ Large Language Models
/ Multilingualism
/ Original Paper
/ Testing and Assessment in Medical Education
2026
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
Do you wish to request the book?
Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: Mixed Methods Study
by
Hastings, Janna
, Strasser, Livia Maria
, Anschuetz, Wilma
, Dennstädt, Fabio
in
Artificial Intelligence
/ Artificial Intelligence (AI) in Medical Education
/ Chatbots and Conversational Agents
/ e-Learning and Digital Medical Education
/ Education, Medical - methods
/ Educational Measurement - methods
/ Educational Measurement - standards
/ eHealth Literacy / Digital Literacy
/ Generative Language Models Including ChatGPT
/ Germany
/ Humans
/ Language
/ Large Language Models
/ Multilingualism
/ Original Paper
/ Testing and Assessment in Medical Education
2026
Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy
We have requested the book for you!
Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.
Oops! Something went wrong.
Looks like we were not able to place your request. Kindly try again later.
Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: Mixed Methods Study
Journal Article
Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: Mixed Methods Study
2026
Request Book From Autostore
and Choose the Collection Method
Overview
Artificial intelligence continues to transform health care, offering promising applications in clinical practice and medical education. While large language models (LLMs), as a form of generative artificial intelligence, have shown potential to match or surpass medical students in licensing examinations, their performance varies across languages. Recent studies highlight the complex influence and interdependency of factors such as language and model type on LLMs' accuracy; yet, cross-language comparisons remain underexplored.
This study evaluates the performance of LLMs in answering medical multiple-choice questions quantitatively and qualitatively across 3 languages (German, French, and Italian), aiming to uncover model capabilities in a multilingual medical education context.
For this mixed methods study, 114 publicly accessible multiple-choice questions in German, French, and Italian from an online self-assessment tool were analyzed. A quantitative performance analysis of several LLMs developed by OpenAI, Meta AI, Anthropic, and DeepSeek was conducted to evaluate their performance on answering the questions in text-only format. For the comparative analysis, a variation of input question language (German, French, and Italian) and prompt language (English vs language-matched) was used. The 2 best-performing LLMs were then prompted to provide answer explanations for incorrectly answered questions. A subsequent qualitative analysis was conducted on these explanations to identify the reasons leading to the incorrect answers.
The performance of LLMs in answering medical multiple-choice questions varied by model and language, showing substantial differences in accuracy (between 64% and 87%). The effect of input question language was significant (P<.01) with models performing best on German questions. Across the analyzed LLMs, prompting in English generally led to better performance in comparison to language-matched prompts, but the top-performing models exceptionally showed comparable results for language-matched prompts. Qualitative analysis revealed that answer explanations of the analyzed models (GPT4o and Claude-Sonnet-3.7) showed different reasoning errors. In several explanations, this occurred despite factual accuracy on the represented topic. Furthermore, this analysis revealed 3 questions to be insufficiently precise.
Our results underline the potential of LLMs in answering medical examination questions and highlight the importance of careful consideration of model choice, prompt, and input languages, because of relevant performance variability across these factors. Analysis of answer explanations demonstrates a valuable use case of LLMs for improving examination question quality in medical education, if data security regulations permit their use. Human oversight of language-sensitive or clinically nuanced content remains essential to determine whether incorrect output stems from flaws in the questions themselves or from errors generated by the LLMs. There is a need for ongoing evaluation as well as transparent reporting to ensure reliable integration of LLMs into medical education contexts.
Publisher
JMIR Publications Inc,JMIR Publications
Subject
/ Artificial Intelligence (AI) in Medical Education
/ Chatbots and Conversational Agents
/ e-Learning and Digital Medical Education
/ Education, Medical - methods
/ Educational Measurement - methods
/ Educational Measurement - standards
/ eHealth Literacy / Digital Literacy
/ Generative Language Models Including ChatGPT
/ Germany
/ Humans
/ Language
This website uses cookies to ensure you get the best experience on our website.