Asset Details

MbrlCatalogueTitleDetail

Do you wish to reserve the book?

Automated Scoring of Narrative Recall Assessments Using Large Language Models Enables Exploration of Alternate Scoring Criteria

by Camacho, Simone , Kleiman, Michael J , Rader, Katana , Galvin, James E

in Criteria / Delayed / Delayed recall / Discrimination / Evaluation / Hallucinations / Humans / Immediate recall / Interrater reliability / Language modeling / Large language models / Narratives / Neuropsychological assessment / Questions / Reliability / Responses / Scores / Weighting

2025

Yes Please

Hey, we have placed the reservation for you!

By the way, why not check out events that you can attend while you pick your title.

Oops! Something went wrong.

Looks like we were not able to place the reservation. Kindly try again later.

Are you sure you want to remove the book from the shelf?

Automated Scoring of Narrative Recall Assessments Using Large Language Models Enables Exploration of Alternate Scoring Criteria

by Camacho, Simone , Kleiman, Michael J , Rader, Katana , Galvin, James E

2025

Confirm

Do you wish to request the book?

Automated Scoring of Narrative Recall Assessments Using Large Language Models Enables Exploration of Alternate Scoring Criteria

by Camacho, Simone , Kleiman, Michael J , Rader, Katana , Galvin, James E

2025

Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy

How would you like to get it?

Submit

We have requested the book for you!

Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.

Oops! Something went wrong.

Looks like we were not able to place your request. Kindly try again later.

Journal Article

Automated Scoring of Narrative Recall Assessments Using Large Language Models Enables Exploration of Alternate Scoring Criteria

Camacho, Simone,

Kleiman, Michael J,

Rader, Katana,

Galvin, James E

2025

Overview

Background Clinical neuropsychological assessments, particularly narrative recall tests, present significant challenges for real‐time scoring due to the need for rapid interpretation. While post‐hoc scoring from recordings improves accuracy, time constraints in clinical settings often preclude this approach. Large Language Models (LLMs) have demonstrated promise in scoring verbal responses, potentially offering improved reliability and consistency compared to traditional human rating methods. Method 28 participants (14 MCI, 14 age‐matched CN), completed the Craft Story 21 assessment, immediate and delayed. ChatGPT‐4o was provided human‐corrected transcripts along with two sets of questions; a standard scoring criteria which matched that used by human raters, and a novel set of questions, both utilizing few‐shot and chain‐of‐thought prompting strategies. Result LLM scoring demonstrated moderately high correlations with human raters for both immediate (r = 0.654, p < .001) and delayed recall (r = 0.655, p < .001). Interrater reliability was comparable between LLM (ICC2k=0.580, p = .015) and human raters (ICC2k=0.574, p = .008). LLM scoring showed superior discrimination of cognitive status for immediate recall (t(26)=2.64, AUC=0.765, p = .013) compared to human scored responses which did not find any difference (t(26)=1.26, AUC=0.638, p = .218), while discrimination in human‐scored assessments was significantly better for the delayed recall responses (t(26)=2.50, AUC=0.747, p = .019) than LLMs (t(26)=1.57, AUC=0.662, p = .128). A novel set of questions was then developed to separate general from specific answers, with variable weighting per question, which exhibited an ability to improve discrimination in the immediate recall assessment (AUC=0.798, p = 0.010) compared to the standard questions, but not in the delayed component. Conclusion LLMs demonstrate potential as a reliable, cost‐effective alternative to human scoring, offering consistent performance and the ability to retrospectively apply modified scoring criteria for research purposes. However, the presence of hallucinations and score variability between model runs suggests that this technology is at a premature stage for use in clinical settings. Nonetheless, rapid advancements in the field may result this strategy being a viable alternative in the very near future, especially as more advanced reasoning models become more available and consistent.

Share this book

Add to My Shelf

Publisher

John Wiley & Sons, Inc

Subject

/ Delayed

/ Humans