Catalogue Search | MBRL

Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study

by Kamineni, Meghana , Prasad, Anoop K , Kim, John in Accuracy , Artificial , Artificial Intelligence

2023

Large language model (LLM)-based artificial intelligence chatbots direct the power of large training data sets toward successive, related tasks as opposed to single-ask tasks, for which artificial intelligence already achieves impressive performance. The capacity of LLMs to assist in the full scope of iterative clinical reasoning via successive prompting, in effect acting as artificial physicians, has not yet been evaluated. This study aimed to evaluate ChatGPT's capacity for ongoing clinical decision support via its performance on standardized clinical vignettes. We inputted all 36 published clinical vignettes from the Merck Sharpe & Dohme (MSD) Clinical Manual into ChatGPT and compared its accuracy on differential diagnoses, diagnostic testing, final diagnosis, and management based on patient age, gender, and case acuity. Accuracy was measured by the proportion of correct responses to the questions posed within the clinical vignettes tested, as calculated by human scorers. We further conducted linear regression to assess the contributing factors toward ChatGPT's performance on clinical tasks. ChatGPT achieved an overall accuracy of 71.7% (95% CI 69.3%-74.1%) across all 36 clinical vignettes. The LLM demonstrated the highest performance in making a final diagnosis with an accuracy of 76.9% (95% CI 67.8%-86.1%) and the lowest performance in generating an initial differential diagnosis with an accuracy of 60.3% (95% CI 54.2%-66.6%). Compared to answering questions about general medical knowledge, ChatGPT demonstrated inferior performance on differential diagnosis (β=-15.8%; P<.001) and clinical management (β=-7.4%; P=.02) question types. ChatGPT achieves impressive accuracy in clinical decision-making, with increasing strength as it gains more clinical information at its disposal. In particular, ChatGPT demonstrates the greatest accuracy in tasks of final diagnosis as compared to initial diagnosis. Limitations include possible model hallucinations and the unclear composition of ChatGPT's training data set.

Journal Article

Share this book

Add to My Shelf

Validation pipeline for machine learning algorithm assessment for multiple vendors

by Bizzo, Bernardo C. , Michalski, Mark H. , Ebrahimian, Shadi in Algorithms , Annotations , Artificial intelligence

2022

A standardized objective evaluation method is needed to compare machine learning (ML) algorithms as these tools become available for clinical use. Therefore, we designed, built, and tested an evaluation pipeline with the goal of normalizing performance measurement of independently developed algorithms, using a common test dataset of our clinical imaging. Three vendor applications for detecting solid, part-solid, and groundglass lung nodules in chest CT examinations were assessed in this retrospective study using our data-preprocessing and algorithm assessment chain. The pipeline included tools for image cohort creation and de-identification; report and image annotation for ground-truth labeling; server partitioning to receive vendor “black box” algorithms and to enable model testing on our internal clinical data (100 chest CTs with 243 nodules) from within our security firewall; model validation and result visualization; and performance assessment calculating algorithm recall, precision, and receiver operating characteristic curves (ROC). Algorithm true positives, false positives, false negatives, recall, and precision for detecting lung nodules were as follows: Vendor-1 (194, 23, 49, 0.80, 0.89); Vendor-2 (182, 270, 61, 0.75, 0.40); Vendor-3 (75, 120, 168, 0.32, 0.39). The AUCs for detection of solid (0.61–0.74), groundglass (0.66–0.86) and part-solid (0.52–0.86) nodules varied between the three vendors. Our ML model validation pipeline enabled testing of multi-vendor algorithms within the institutional firewall. Wide variations in algorithm performance for detection as well as classification of lung nodules justifies the premise for a standardized objective ML algorithm evaluation process.

Journal Article

Share this book

Add to My Shelf

Performance of threshold-based stone segmentation and radiomics for determining the composition of kidney stones from single-energy CT

by Kalra, Mannudeep K , Ebrahimian, Shadi , Bizzo, Bernardo in Abdomen , Calcium , Calcium oxalate

2023

PurposeKnowledge of kidney stone composition can help in patient management; urine composition analysis and dual-energy CT are frequently used to assess stone type. We assessed if threshold-based stone segmentation and radiomics can determine the composition of kidney stones from single-energy, non-contrast abdomen–pelvis CT.MethodsWith IRB approval, we identified 218 consecutive patients (mean age 64 ± 13 years; male:female 138:80) with the presence of kidney stones on non-contrast, abdomen–pelvis CT and surgical or biochemical proof of their stone composition. CT examinations were performed on one of the seven multidetector-row scanners from four vendors (GE, Philips, Siemens, Toshiba). Deidentified CT images were processed with a radiomics prototype (Frontier, Siemens Healthineers) to segment the entire kidney volumes with an AI-based organ segmentation tool. We applied a threshold of 130 HU to isolate stones in the segmented kidneys and to estimate radiomics over the segmented stone volume. A coinvestigator verified kidney stone segmentation and adjusted the volume of interest to include the entire stone volume when necessary. We applied multiple logistic regression tests with precision recall plots to obtain area under the curve (AUC) using a built-in R statistical program.ResultsThe threshold-based stone segmentation successfully isolated kidney stones (uric acid: n = 102 patients, calcium oxalate/phosphate: n = 116 patients) in all patients. Radiomics differentiated between calcium and uric acid stones with an AUC of 0.78 (p < 0.01, 95% CI 0.73–0.83), 0.79 sensitivity, and 0.90 specificity regardless of CT vendors (GE CT: AUC = 0.82, p < 0.01, 95% CI 0.740–0896; Siemens CT: AUC = 0.77, 95% CI 0.700–0.846, p < 0.01).ConclusionAutomated threshold-based stone segmentation and radiomics can differentiate between calcium oxalate/phosphate and urate stones from non-contrast, single-energy abdomen CT.

Journal Article

Share this book

Add to My Shelf

Proactive Polypharmacy Management Using Large Language Models: Opportunities to Enhance Geriatric Care

in Artificial intelligence , Chatbots , Decisions

2024

Polypharmacy remains an important challenge for patients with extensive medical complexity. Given the primary care shortage and the increasing aging population, effective polypharmacy management is crucial to manage the increasing burden of care. The capacity of large language model (LLM)-based artificial intelligence to aid in polypharmacy management has yet to be evaluated. Here, we evaluate ChatGPT’s performance in polypharmacy management via its deprescribing decisions in standardized clinical vignettes. We inputted several clinical vignettes originally from a study of general practicioners’ deprescribing decisions into ChatGPT 3.5, a publicly available LLM, and evaluated its capacity for yes/no binary deprescribing decisions as well as list-based prompts in which the model was prompted to choose which of several medications to deprescribe. We recorded ChatGPT responses to yes/no binary deprescribing prompts and the number and types of medications deprescribed. In yes/no binary deprescribing decisions, ChatGPT universally recommended deprescribing medications regardless of ADL status in patients with no overlying CVD history; in patients with CVD history, ChatGPT’s answers varied by technical replicate. Total number of medications deprescribed ranged from 2.67 to 3.67 (out of 7) and did not vary with CVD status, but increased linearly with severity of ADL impairment. Among medication types, ChatGPT preferentially deprescribed pain medications. ChatGPT’s deprescribing decisions vary along the axes of ADL status, CVD history, and medication type, indicating some concordance of internal logic between general practitioners and the model. These results indicate that specifically trained LLMs may provide useful clinical support in polypharmacy management for primary care physicians.

Journal Article

Share this book

Add to My Shelf

Head CT deep learning model is highly accurate for early infarct estimation

by Gauriau, Romane , Pourvaziri, Ali , Lev, Michael H. in 631/114/1305 , 631/378 , 692/617

2023

Non-contrast head CT (NCCT) is extremely insensitive for early (< 3–6 h) acute infarct identification. We developed a deep learning model that detects and delineates suspected early acute infarcts on NCCT, using diffusion MRI as ground truth (3566 NCCT/MRI training patient pairs). The model substantially outperformed 3 expert neuroradiologists on a test set of 150 CT scans of patients who were potential candidates for thrombectomy (60 stroke-negative, 90 stroke-positive middle cerebral artery territory only infarcts), with sensitivity 96% (specificity 72%) for the model versus 61–66% (specificity 90–92%) for the experts; model infarct volume estimates also strongly correlated with those of diffusion MRI (r 2 > 0.98). When this 150 CT test set was expanded to include a total of 364 CT scans with a more heterogeneous distribution of infarct locations (94 stroke-negative, 270 stroke-positive mixed territory infarcts), model sensitivity was 97%, specificity 99%, for detection of infarcts larger than the 70 mL volume threshold used for patient selection in several major randomized controlled trials of thrombectomy treatment.

Journal Article

Share this book

Add to My Shelf

Evaluation of an artificial intelligence model for opportunistic Agatston scoring on non-gated chest computed tomography

by Bizzo, Bernardo C. , Halle, Madeleine A. , Hillis, James M. in 692/308 , 692/308/575 , 692/53/2423

2025

The Agatston score is a measure of cardiovascular disease traditionally calculated on cardiac gated computed tomography (CT) of the chest. Cardiac gated CT is resource-intensive, can be hard to access, and involves extra radiation exposure. Artificial intelligence (AI) can be used to opportunistically calculate Agatston scores on non-gated CTs performed for other indications. This retrospective standalone performance assessment compared the accuracy of an AI model (Riverain Technologies ClearRead CT CAC) at calculating Agatston scores on non-gated CTs to both consensus radiologist interpretations on the same CTs and Agatston scores from the original radiology reports of paired cardiac gated CTs. It involved 491 non-contrast CT chest cases acquired at five hospitals in the United States between January 2022 and December 2023; approximately two-thirds had a paired cardiac gated CT. It compared the agreement of Agatston categories (0, 1–99, 100–399 and ≥ 400) using the quadratic weighted Kappa coefficient and the correlation of Agatston scores using the Spearman coefficient. The agreement of Agatston categories between the AI model and ground truth radiologists was 0.959 (95% CI: 0.943 to 0.975); this result was broadly consistent across sex, age group, race, ethnicity and CT scanner manufacturer subgroups. The agreement between the AI model and paired cardiac gated CT was 0.906 (95% CI: 0.882 to 0.927). The correlations of Agatston scores for these two comparisons were 0.975 (95% CI: 0.962 to 0.987) and 0.942 (95% CI: 0.920 to 0.957) respectively. The assessed AI model accurately calculated Agatston scores on non-gated CTs and produced similar scores to paired cardiac gated CTs. Its use could broaden screening for atherosclerotic cardiovascular disease, enabling opportunistic screening on CTs captured for other indications.

Journal Article

Share this book

Add to My Shelf

Association of Artificial Intelligence–Aided Chest Radiograph Interpretation With Reader Performance and Efficiency

by Lee, Sanghyup , Miller, Benjamin , McDermott, Shaunagh in Adult , Artificial Intelligence , Cohort Studies

2022

The efficient and accurate interpretation of radiologic images is paramount. To evaluate whether a deep learning-based artificial intelligence (AI) engine used concurrently can improve reader performance and efficiency in interpreting chest radiograph abnormalities. This multicenter cohort study was conducted from April to November 2021 and involved radiologists, including attending radiologists, thoracic radiology fellows, and residents, who independently participated in 2 observer performance test sessions. The sessions included a reading session with AI and a session without AI, in a randomized crossover manner with a 4-week washout period in between. The AI produced a heat map and the image-level probability of the presence of the referrable lesion. The data used were collected at 2 quaternary academic hospitals in Boston, Massachusetts: Beth Israel Deaconess Medical Center (The Medical Information Mart for Intensive Care Chest X-Ray [MIMIC-CXR]) and Massachusetts General Hospital (MGH). The ground truths for the labels were created via consensual reading by 2 thoracic radiologists. Each reader documented their findings in a customized report template, in which the 4 target chest radiograph findings and the reader confidence of the presence of each finding was recorded. The time taken for reporting each chest radiograph was also recorded. Sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC) were calculated for each target finding. A total of 6 radiologists (2 attending radiologists, 2 thoracic radiology fellows, and 2 residents) participated in the study. The study involved a total of 497 frontal chest radiographs-247 from the MIMIC-CXR data set (demographic data for patients were not available) and 250 chest radiographs from MGH (mean [SD] age, 63 [16] years; 133 men [53.2%])-from adult patients with and without 4 target findings (pneumonia, nodule, pneumothorax, and pleural effusion). The target findings were found in 351 of 497 chest radiographs. The AI was associated with higher sensitivity for all findings compared with the readers (nodule, 0.816 [95% CI, 0.732-0.882] vs 0.567 [95% CI, 0.524-0.611]; pneumonia, 0.887 [95% CI, 0.834-0.928] vs 0.673 [95% CI, 0.632-0.714]; pleural effusion, 0.872 [95% CI, 0.808-0.921] vs 0.889 [95% CI, 0.862-0.917]; pneumothorax, 0.988 [95% CI, 0.932-1.000] vs 0.792 [95% CI, 0.756-0.827]). AI-aided interpretation was associated with significantly improved reader sensitivities for all target findings, without negative impacts on the specificity. Overall, the AUROCs of readers improved for all 4 target findings, with significant improvements in detection of pneumothorax and nodule. The reporting time with AI was 10% lower than without AI (40.8 vs 36.9 seconds; difference, 3.9 seconds; 95% CI, 2.9-5.2 seconds; P < .001). These findings suggest that AI-aided interpretation was associated with improved reader performance and efficiency for identifying major thoracic findings on a chest radiograph.

Journal Article

Share this book

Add to My Shelf

Empathy and Equity: Key Considerations for Large Language Model Adoption in Health Care

by Rao, Arya , Koranteng, Erica , Landman, Adam in Delivery of Health Care , Empathy , Health Facilities

2023

The growing presence of large language models (LLMs) in health care applications holds significant promise for innovative advancements in patient care. However, concerns about ethical implications and potential biases have been raised by various stakeholders. Here, we evaluate the ethics of LLMs in medicine along 2 key axes: empathy and equity. We outline the importance of these factors in novel models of care and develop frameworks for addressing these alongside LLM deployment.

Journal Article

Share this book

Add to My Shelf

The lucent yet opaque challenge of regulating artificial intelligence in radiology

by Bizzo, Bernardo C. , Adams-Prassl, Jeremias , Hillis, James M. in 692/700/1421 , 692/700/1538 , Algorithms

2024

Journal Article

Share this book

Add to My Shelf

Implementing the DICOM Standard for Digital Pathology

by Clunie, David A. , Milstone, David S. , Kotecha, Gopal K. in Coding , Communication , Compression tests

2018

Background: Digital Imaging and Communications in Medicine (DICOM®) is the standard for the representation, storage, and communication of medical images and related information. A DICOM file format and communication protocol for pathology have been defined; however, adoption by vendors and in the field is pending. Here, we implemented the essential aspects of the standard and assessed its capabilities and limitations in a multisite, multivendor healthcare network. Methods: We selected relevant DICOM attributes, developed a program that extracts pixel data and pixel-related metadata, integrated patient and specimen-related metadata, populated and encoded DICOM attributes, and stored DICOM files. We generated the files using image data from four vendor-specific image file formats and clinical metadata from two departments with different laboratory information systems. We validated the generated DICOM files using recognized DICOM validation tools and measured encoding, storage, and access efficiency for three image compression methods. Finally, we evaluated storing, querying, and retrieving data over the web using existing DICOM archive software. Results: Whole slide image data can be encoded together with relevant patient and specimen-related metadata as DICOM objects. These objects can be accessed efficiently from files or through RESTful web services using existing software implementations. Performance measurements show that the choice of image compression method has a major impact on data access efficiency. For lossy compression, JPEG achieves the fastest compression/decompression rates. For lossless compression, JPEG-LS significantly outperforms JPEG 2000 with respect to data encoding and decoding speed. Conclusion: Implementation of DICOM allows efficient access to image data as well as associated metadata. By leveraging a wealth of existing infrastructure solutions, the use of DICOM facilitates enterprise integration and data exchange for digital pathology.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter