Asset Details
MbrlCatalogueTitleDetail
Do you wish to reserve the book?
Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study
by
Dong, Pei
, Li, Huali
, Wang, Meiyun
, Xue, Jon
, Yu, Xuan
, Shen, Dinggang
, Bai, Yan
, Li, Xiaodong
, Wu, Qingxia
, Wu, Yaping
, Wang, Yan
in
Accuracy
/ Chatbots
/ Cross-sectional studies
/ Large language models
/ Magnetic resonance imaging
/ Medical imaging
/ Multimedia
/ Original Paper
/ Ovaries
/ Performance evaluation
/ Radiology
/ Tomography
2024
Hey, we have placed the reservation for you!
By the way, why not check out events that you can attend while you pick your title.
You are currently in the queue to collect this book. You will be notified once it is your turn to collect the book.
Oops! Something went wrong.
Looks like we were not able to place the reservation. Kindly try again later.
Are you sure you want to remove the book from the shelf?
Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study
by
Dong, Pei
, Li, Huali
, Wang, Meiyun
, Xue, Jon
, Yu, Xuan
, Shen, Dinggang
, Bai, Yan
, Li, Xiaodong
, Wu, Qingxia
, Wu, Yaping
, Wang, Yan
in
Accuracy
/ Chatbots
/ Cross-sectional studies
/ Large language models
/ Magnetic resonance imaging
/ Medical imaging
/ Multimedia
/ Original Paper
/ Ovaries
/ Performance evaluation
/ Radiology
/ Tomography
2024
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
Do you wish to request the book?
Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study
by
Dong, Pei
, Li, Huali
, Wang, Meiyun
, Xue, Jon
, Yu, Xuan
, Shen, Dinggang
, Bai, Yan
, Li, Xiaodong
, Wu, Qingxia
, Wu, Yaping
, Wang, Yan
in
Accuracy
/ Chatbots
/ Cross-sectional studies
/ Large language models
/ Magnetic resonance imaging
/ Medical imaging
/ Multimedia
/ Original Paper
/ Ovaries
/ Performance evaluation
/ Radiology
/ Tomography
2024
Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy
We have requested the book for you!
Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.
Oops! Something went wrong.
Looks like we were not able to place your request. Kindly try again later.
Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study
Journal Article
Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study
2024
Request Book From Autostore
and Choose the Collection Method
Overview
Large language models show promise for improving radiology workflows, but their performance on structured radiological tasks such as Reporting and Data Systems (RADS) categorization remains unexplored.
This study aims to evaluate 3 large language model chatbots-Claude-2, GPT-3.5, and GPT-4-on assigning RADS categories to radiology reports and assess the impact of different prompting strategies.
This cross-sectional study compared 3 chatbots using 30 radiology reports (10 per RADS criteria), using a 3-level prompting strategy: zero-shot, few-shot, and guideline PDF-informed prompts. The cases were grounded in Liver Imaging Reporting & Data System (LI-RADS) version 2018, Lung CT (computed tomography) Screening Reporting & Data System (Lung-RADS) version 2022, and Ovarian-Adnexal Reporting & Data System (O-RADS) magnetic resonance imaging, meticulously prepared by board-certified radiologists. Each report underwent 6 assessments. Two blinded reviewers assessed the chatbots' response at patient-level RADS categorization and overall ratings. The agreement across repetitions was assessed using Fleiss κ.
Claude-2 achieved the highest accuracy in overall ratings with few-shot prompts and guideline PDFs (prompt-2), attaining 57% (17/30) average accuracy over 6 runs and 50% (15/30) accuracy with k-pass voting. Without prompt engineering, all chatbots performed poorly. The introduction of a structured exemplar prompt (prompt-1) increased the accuracy of overall ratings for all chatbots. Providing prompt-2 further improved Claude-2's performance, an enhancement not replicated by GPT-4. The interrun agreement was substantial for Claude-2 (k=0.66 for overall rating and k=0.69 for RADS categorization), fair for GPT-4 (k=0.39 for both), and fair for GPT-3.5 (k=0.21 for overall rating and k=0.39 for RADS categorization). All chatbots showed significantly higher accuracy with LI-RADS version 2018 than with Lung-RADS version 2022 and O-RADS (P<.05); with prompt-2, Claude-2 achieved the highest overall rating accuracy of 75% (45/60) in LI-RADS version 2018.
When equipped with structured prompts and guideline PDFs, Claude-2 demonstrated potential in assigning RADS categories to radiology cases according to established criteria such as LI-RADS version 2018. However, the current generation of chatbots lags in accurately categorizing cases based on more recent RADS criteria.
Publisher
JMIR Publications
This website uses cookies to ensure you get the best experience on our website.