Catalogue Search | MBRL

Raising awareness of potential biases in medical machine learning: Experience from a Datathon

by Hochheiser, Harry , Taylor, W. Michael , Kravchenko, Olga V.

2025

Objective: To challenge clinicians and informaticians to learn about potential sources of bias in medical machine learning models through investigation of data and predictions from an open-source severity of illness score. Methods: Over a two-day period (total elapsed time approximately 28 hours), we conducted a datathon that challenged interdisciplinary teams to investigate potential sources of bias in the Global Open Source Severity of Illness Score. Teams were invited to develop hypotheses, to use tools of their choosing to identify potential sources of bias, and to provide a final report. Results: Five teams participated, three of which included both informaticians and clinicians. Most (4/5) used Python for analyses, the remaining team used R. Common analysis themes included relationship of the GOSSIS-1 prediction score with demographics and care related variables; relationships between demographics and outcomes; calibration and factors related to the context of care; and the impact of missingness. Representativeness of the population, differences in calibration and model performance among groups, and differences in performance across hospital settings were identified as possible sources of bias. Discussion: Datathons are a promising approach for challenging developers and users to explore questions relating to unrecognized biases in medical machine learning algorithms.

Journal Article

Share this book

Add to My Shelf

An Analysis of Explainability of Predictions in Deep Networks: Methods and Applications

by Nourelahi, Mehdi in Artificial intelligence , Computer science , Electrical engineering

2024

Convolutional Neural Networks (CNNs) are pivotal in computer vision tasks, with their evaluation often centered around test-set accuracy, out-of-distribution performance, and explainability via feature attribution methods. However, the interplay between these criteria remains unclear. This study bridges this gap by conducting a comprehensive analysis across 12 ImageNet-trained CNNs, encompassing three training algorithms and five architectures. We evaluate nine feature attribution methods to elucidate their relationships and implications for machine learning practitioners. Our findings reveal insights into CNN performance across the evaluated criteria. Firstly, adversarially robust CNNs exhibit higher explainability scores with gradient-based attribution methods, contrasting with CAM-based or perturbation-based methods. Secondly, despite their high accuracy, AdvProp models do not consistently excel in explainability, highlighting a decoupling of these metrics. Thirdly, among the attribution methods, Grad-CAM and RISE consistently emerge as superior choices, underscoring their reliability across diverse CNN architectures. Moreover, our analysis exposes biases in attribution methods. For instance, Insertion and Deletion methods show preferences towards vanilla and robust models, respectively, reflecting their alignment with CNN confidence score distributions.Furthermore, we explore the impact of saliency-based data augmentation on CNN performance in both vanilla and adversarial training settings. Through meticulous evaluations in a single-sample augmentation framework, we contrast methods that preserve versus remove salient regions. Our results demonstrate that saliency-based augmentation consistently outperforms random methods, substantiating its efficacy in enhancing CNN training.In conclusion, this study contributes a dual perspective: elucidating the intricate relationships between test-set accuracy, out-of-distribution performance, and explainability in CNNs, while also shedding light on the influential role of saliency-based data augmentation in improving CNN training outcomes. These findings provide actionable insights for ML practitioners, advocating for thoughtful selection of attribution methods and augmentation strategies tailored to specific application requirements and CNN architectures.

Dissertation

Share this book

Add to My Shelf

Machine Learning Predicts Bleeding Risk in Atrial Fibrillation Patients on Direct Oral Anticoagulant

by Jain, Sandeep , Bliden, Kevin P. , Gurbel, Paul A. in Administration, Oral , Aged , Algorithms

2025

•ML models outperformed conventional scores in predicting major bleeding in AF.•Random forest achieved an AUC of 0.76 vs HAS-BLED's AUC of 0.57 (p < 0.001).•SHAP analysis identified new bleeding risk factors like BMI and cholesterol profile.•Study included 24,468 AF patients on DOACs with a 5-year follow-up for bleeding events.•ML models offer more personalized bleeding risk assessment for AF patients on DOACs. Predicting major bleeding in nonvalvular atrial fibrillation (AF) patients on direct oral anticoagulants (DOACs) is crucial for personalized care. Alternatives like left atrial appendage closure devices lower stroke risk with fewer nonprocedural bleeds. This study compares machine learning (ML) models with conventional bleeding risk scores (HAS-BLED, ORBIT, and ATRIA) for predicting bleeding events requiring hospitalization in AF patients on DOACs at their index cardiologist visit. This retrospective cohort study used electronic health records from 2010 to 2022 at the University of Pittsburgh Medical Center. It included 24,468 nonvalvular AF patients (age ≥18) on DOACs, excluding those with prior significant bleeding or warfarin use. The primary outcome was hospitalization for bleeding within one year, with follow-up at one, two, and five years. ML algorithms (logistic regression, classification trees, random forest, XGBoost, k-nearest neighbor, naïve Bayes) were compared for performance. Of 24,468 patients, 553 (2.3%) had bleeding within one year, 829 (3.5%) within two years, and 1,292 (5.8%) within five years. ML models outperformed HAS-BLED, ATRIA, and ORBIT in 1-year predictions. The random forest model achieved an AUC of 0.76 (0.70 to 0.81), G-Mean of 0.67, and net reclassification index of 0.14 compared to HAS-BLED's AUC of 0.57 (p < 0.001). ML models showed superior results across all timepoints and for hemorrhagic stroke. SHAP analysis identified new risk factors, including BMI, cholesterol profile, and insurance type. In conclusion, ML models demonstrated improved performance to conventional bleeding risk scores and uncovered novel risk factors, offering potential for more personalized bleeding risk assessment in AF patients on DOACs.

Journal Article

Share this book

Add to My Shelf

How explainable are adversarially-robust CNNs?

by Chen, Peijie , Nguyen, Anh , Nourelahi, Mehdi in Algorithms , Artificial neural networks , Confidence intervals

2023

Three important criteria of existing convolutional neural networks (CNNs) are (1) test-set accuracy; (2) out-of-distribution accuracy; and (3) explainability. While these criteria have been studied independently, their relationship is unknown. For example, do CNNs that have a stronger out-of-distribution performance have also stronger explainability? Furthermore, most prior feature-importance studies only evaluate methods on 2-3 common vanilla ImageNet-trained CNNs, leaving it unknown how these methods generalize to CNNs of other architectures and training algorithms. Here, we perform the first, large-scale evaluation of the relations of the three criteria using 9 feature-importance methods and 12 ImageNet-trained CNNs that are of 3 training algorithms and 5 CNN architectures. We find several important insights and recommendations for ML practitioners. First, adversarially robust CNNs have a higher explainability score on gradient-based attribution methods (but not CAM-based or perturbation-based methods). Second, AdvProp models, despite being highly accurate more than both vanilla and robust models alone, are not superior in explainability. Third, among 9 feature attribution methods tested, GradCAM and RISE are consistently the best methods. Fourth, Insertion and Deletion are biased towards vanilla and robust models respectively, due to their strong correlation with the confidence score distributions of a CNN. Fifth, we did not find a single CNN to be the best in all three criteria, which interestingly suggests that CNNs are harder to interpret as they become more accurate.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter