Asset Details

MbrlCatalogueTitleDetail

Do you wish to reserve the book?

Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study

by Dong, Pei , Li, Huali , Wang, Meiyun , Xue, Jon , Yu, Xuan , Shen, Dinggang , Bai, Yan , Li, Xiaodong , Wu, Qingxia , Wu, Yaping , Wang, Yan

in Accuracy / Chatbots / Cross-sectional studies / Large language models / Magnetic resonance imaging / Medical imaging / Multimedia / Original Paper / Ovaries / Performance evaluation / Radiology / Tomography

2024

Yes Please

Hey, we have placed the reservation for you!

By the way, why not check out events that you can attend while you pick your title.

Oops! Something went wrong.

Looks like we were not able to place the reservation. Kindly try again later.

Are you sure you want to remove the book from the shelf?

Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study

by Dong, Pei , Li, Huali , Wang, Meiyun , Xue, Jon , Yu, Xuan , Shen, Dinggang , Bai, Yan , Li, Xiaodong , Wu, Qingxia , Wu, Yaping , Wang, Yan

2024

Confirm

Do you wish to request the book?

Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study

by Dong, Pei , Li, Huali , Wang, Meiyun , Xue, Jon , Yu, Xuan , Shen, Dinggang , Bai, Yan , Li, Xiaodong , Wu, Qingxia , Wu, Yaping , Wang, Yan

2024

Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy

How would you like to get it?

Submit

We have requested the book for you!

Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.

Oops! Something went wrong.

Looks like we were not able to place your request. Kindly try again later.

Journal Article

Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study

Dong, Pei,

Li, Huali,

Wang, Meiyun,

Xue, Jon,

Yu, Xuan,

Shen, Dinggang,

Bai, Yan,

Li, Xiaodong,

Wu, Qingxia,

Wu, Yaping,

Wang, Yan

2024

Overview

Large language models show promise for improving radiology workflows, but their performance on structured radiological tasks such as Reporting and Data Systems (RADS) categorization remains unexplored. This study aims to evaluate 3 large language model chatbots-Claude-2, GPT-3.5, and GPT-4-on assigning RADS categories to radiology reports and assess the impact of different prompting strategies. This cross-sectional study compared 3 chatbots using 30 radiology reports (10 per RADS criteria), using a 3-level prompting strategy: zero-shot, few-shot, and guideline PDF-informed prompts. The cases were grounded in Liver Imaging Reporting & Data System (LI-RADS) version 2018, Lung CT (computed tomography) Screening Reporting & Data System (Lung-RADS) version 2022, and Ovarian-Adnexal Reporting & Data System (O-RADS) magnetic resonance imaging, meticulously prepared by board-certified radiologists. Each report underwent 6 assessments. Two blinded reviewers assessed the chatbots' response at patient-level RADS categorization and overall ratings. The agreement across repetitions was assessed using Fleiss κ. Claude-2 achieved the highest accuracy in overall ratings with few-shot prompts and guideline PDFs (prompt-2), attaining 57% (17/30) average accuracy over 6 runs and 50% (15/30) accuracy with k-pass voting. Without prompt engineering, all chatbots performed poorly. The introduction of a structured exemplar prompt (prompt-1) increased the accuracy of overall ratings for all chatbots. Providing prompt-2 further improved Claude-2's performance, an enhancement not replicated by GPT-4. The interrun agreement was substantial for Claude-2 (k=0.66 for overall rating and k=0.69 for RADS categorization), fair for GPT-4 (k=0.39 for both), and fair for GPT-3.5 (k=0.21 for overall rating and k=0.39 for RADS categorization). All chatbots showed significantly higher accuracy with LI-RADS version 2018 than with Lung-RADS version 2022 and O-RADS (P<.05); with prompt-2, Claude-2 achieved the highest overall rating accuracy of 75% (45/60) in LI-RADS version 2018. When equipped with structured prompts and guideline PDFs, Claude-2 demonstrated potential in assigning RADS categories to radiology cases according to established criteria such as LI-RADS version 2018. However, the current generation of chatbots lags in accurately categorizing cases based on more recent RADS criteria.

Share this book

Add to My Shelf

Publisher

JMIR Publications

Subject

Accuracy

/ Chatbots

/ Cross-sectional studies

/ Large language models

/ Magnetic resonance imaging

/ Medical imaging

/ Multimedia