Catalogue Search | MBRL

The Clinician and Dataset Shift in Artificial Intelligence

by Singh, Karandeep , Kupke, Annabel , Kohane, Isaac S in Accuracy , Algorithms , and Education

2021

This letter outlines how to identify, and potentially mitigate, common sources of “dataset shift” in machine-learning systems. This occurs when the model “training data” differ from the data used by the model to provide diagnostic, prognostic, or treatment advice.

Journal Article

Share this book

Add to My Shelf

Scorecard for synthetic medical data evaluation

by Sizikova, Elena , Badano, Aldo , Zamzmi, Ghada in 639/166/985 , 639/705/1042 , 639/705/1046

2025

Although the interest in synthetic medical data (SMD) for developing and testing artificial intelligence (AI) methods is growing, the absence of a comprehensive framework to evaluate the quality and applicability of SMD hinders its wider adoption. Here, we outline an evaluation framework designed to meet the unique requirements of medical applications. We also introduce SMD scorecard, a comprehensive report accompanying artificially generated datasets. This scorecard provides a quantitative assessment of SMD across seven criteria (7 Cs), complemented by a descriptive section that contains all relevant information about the dataset. The SMD scorecard provides a practical framework for evaluating and reporting the quality of synthetic data, which can benefit SMD developers and users. The use of synthetic medical data (SMD) in AI development is on the rise, but its broader application is limited by the lack of a comprehensive evaluation framework. Here, Ghada Zamzmi and colleagues present a novel evaluation framework tailored for medical applications, along with an SMD scorecard that quantitatively assesses synthetic datasets across seven key criteria.

Journal Article

Share this book

Add to My Shelf

Scaling medical device regulatory science using large language models

by Vossler, Patrick , Singh, Karandeep , Feng, Jean in 631/114 , 639/166 , 639/705

2026

Advances in artificial intelligence (AI) and machine learning (ML) have led to a surge in AI/ML-enabled medical devices, posing new challenges for regulators because best practices for developing, testing, and monitoring these devices are still emerging. Consequently, there is a critical need for up-to-date data analyses of the regulatory landscape to inform policy-making. However, such analyses have historically relied upon manual annotation efforts because regulatory documents are unstructured, complex, multi-modal, and filled with jargon. Efforts to automate annotation using simple natural language processing methods have achieved limited success, as they lack the reasoning needed to interpret regulatory materials. Recent progress in large language models (LLMs) presents an unprecedented opportunity to unlock information embedded in regulatory documents. This work conducts the first wide-ranging validation study of LLMs for scaling data analyses in the field of medical device regulatory science. Evaluating LLM outputs using expert manual annotations and “LLM-as-a-judge,” we find that LLMs can accurately extract attributes spanning pre- and post-market settings, with accuracy rates often reaching 80% or higher. We then demonstrate how LLMs can scale up analyses in three applications: (1) monitoring device validation practices, (2) coding medical device reports, and (3) identifying potential risk factors for post-market adverse events.

Journal Article

Share this book

Add to My Shelf

A data-driven framework for identifying patient subgroups on which an AI/machine learning model may underperform

by Pai, Vinay , Sahiner, Berkman , Diamond, Matthew C. in 692/499 , 692/700/1538 , Algorithms

2024

A fundamental goal of evaluating the performance of a clinical model is to ensure it performs well across a diverse intended patient population. A primary challenge is that the data used in model development and testing often consist of many overlapping, heterogeneous patient subgroups that may not be explicitly defined or labeled. While a model’s average performance on a dataset may be high, the model can have significantly lower performance for certain subgroups, which may be hard to detect. We describe an algorithmic framework for identifying subgroups with potential performance disparities (AFISP), which produces a set of interpretable phenotypes corresponding to subgroups for which the model’s performance may be relatively lower. This could allow model evaluators, including developers and users, to identify possible failure modes prior to wide-scale deployment. We illustrate the application of AFISP by applying it to a patient deterioration model to detect significant subgroup performance disparities, and show that AFISP is significantly more scalable than existing algorithmic approaches.

Journal Article

Share this book

Add to My Shelf

From Precision to Personalized: Catalyzing AI ‐Enabled Innovation in Drug Development

by Langevin, Brooke , Dunn, Allison , Gobburu, Jogarao V. S. in Artificial intelligence , Artificial Intelligence - trends , Biomarkers

2026

In 2015, President Obama announced the Precision Medicine Initiative in his nationally televised State of the Union address. The vision was bold: to enable a new era of medicine through research, technology, and policies that empower patients, researchers, and providers to work together toward the development of individualized care. Now, more than 10 years later, how have the treatments available to patients changed?

Journal Article

Share this book

Add to My Shelf

Detecting dataset bias in medical AI using a generalized and modality agnostic auditing approach

by Pavlak, Mitchell , Harrigian, Keith , Zirikly, Ayah

2026

Despite many success stories along the path of Artificial Intelligence's (AI) rise in healthcare, there are comparably many reports of significant shortcomings and unexpected behavior of AI in deployment. A major risk is AI's reliance on association-based learning, where non-representative machine learning datasets can amplify latent bias during training and hide it during testing. To unlock new tools capable of detecting such AI bias issues, we present Generalized Attribute Utility and Detectability-Induced bias Testing (G-AUDIT). G-AUDIT is a data modality-agnostic dataset auditing approach that automatically quantifies shortcut learning risks by examining the interplay between task-level annotations, sensor-level measurements, and patient, environmental, and acquisition characteristics. We demonstrate the broad applicability of this method by analyzing a range of medical datasets across three distinct modalities (images, text, and tabular data) and machine learning tasks, successfully identifying potential shortcuts commonly overlooked by traditional qualitative methods.

Journal Article

Share this book

Add to My Shelf

A Unifying Causal Framework for Analyzing Dataset Shift-stable Learning Algorithms

by Chen, Bryant , Subbaswamy, Adarsh , Saria, Suchi in Computer simulation , Datasets , Deletion

2022

Recent interest in the external validity of prediction models (i.e., the problem of different train and test distributions, known as dataset shift) has produced many methods for finding predictive distributions that are invariant to dataset shifts and can be used for prediction in new, unseen environments. However, these methods consider different types of shifts and have been developed under disparate frameworks, making it difficult to theoretically analyze how solutions differ with respect to stability and accuracy. Taking a causal graphical view, we use a flexible graphical representation to express various types of dataset shifts. Given a known graph of the data generating process, we show that all invariant distributions correspond to a causal hierarchy of graphical operators which disable the edges in the graph that are responsible for the shifts. The hierarchy provides a common theoretical underpinning for understanding when and how stability to shifts can be achieved, and in what ways stable distributions can differ. We use it to establish conditions for minimax optimal performance across environments, and derive new algorithms that find optimal stable distributions. Using this new perspective, we empirically demonstrate that that there is a tradeoff between minimax and average performance.

Paper

Share this book

Add to My Shelf

LLMs Judging LLMs: A Simplex Perspective

by Vossler, Patrick , Fan, Xia , Feng, Jean in Bayesian analysis , Epistemology , Free form

2026

Given the challenge of automatically evaluating free-form outputs from large language models (LLMs), an increasingly common solution is to use LLMs themselves as the judging mechanism, without any gold-standard scores. Implicitly, this practice accounts for only sampling variability (aleatoric uncertainty) and ignores uncertainty about judge quality (epistemic uncertainty). While this is justified if judges are perfectly accurate, it is unclear when such an approach is theoretically valid and practically robust. We study these questions for the task of ranking LLM candidates from a novel geometric perspective: for \\(M\\)-level scoring systems, both LLM judges and candidates can be represented as points on an \\((M-1)\\)-dimensional probability simplex, where geometric concepts (e.g., triangle areas) correspond to key ranking concepts. This perspective yields intuitive theoretical conditions and visual proofs for when rankings are identifiable; for instance, we provide a formal basis for the ``folk wisdom'' that LLM judges are more effective for two-level scoring (\\(M=2\\)) than multi-level scoring (\\(M>2\\)). Leveraging the simplex, we design geometric Bayesian priors that encode epistemic uncertainty about judge quality and vary the priors to conduct sensitivity analyses. Experiments on LLM benchmarks show that rankings based solely on LLM judges are robust in many but not all datasets, underscoring both their widespread success and the need for caution. Our Bayesian method achieves substantially higher coverage rates than existing procedures, highlighting the importance of modeling epistemic uncertainty.

Paper

Share this book

Add to My Shelf

LLMs Judging LLMs: A Simplex Perspective

by Vossler, Patrick , Fan, Xia , Feng, Jean in Bayesian analysis , Epistemology , Free form

2025

Given the challenge of automatically evaluating free-form outputs from large language models (LLMs), an increasingly common solution is to use LLMs themselves as the judging mechanism, without any gold-standard scores. Implicitly, this practice accounts for only sampling variability (aleatoric uncertainty) and ignores uncertainty about judge quality (epistemic uncertainty). While this is justified if judges are perfectly accurate, it is unclear when such an approach is theoretically valid and practically robust. We study these questions for the task of ranking LLM candidates from a novel geometric perspective: for \\(M\\)-level scoring systems, both LLM judges and candidates can be represented as points on an \\((M-1)\\)-dimensional probability simplex, where geometric concepts (e.g., triangle areas) correspond to key ranking concepts. This perspective yields intuitive theoretical conditions and visual proofs for when rankings are identifiable; for instance, we provide a formal basis for the ``folk wisdom'' that LLM judges are more effective for two-level scoring (\\(M=2\\)) than multi-level scoring (\\(M>2\\)). Leveraging the simplex, we design geometric Bayesian priors that encode epistemic uncertainty about judge quality and vary the priors to conduct sensitivity analyses. Experiments on LLM benchmarks show that rankings based solely on LLM judges are robust in many but not all datasets, underscoring both their widespread success and the need for caution. Our Bayesian method achieves substantially higher coverage rates than existing procedures, highlighting the importance of modeling epistemic uncertainty.

Paper

Share this book

Add to My Shelf

I-SPEC: An End-to-End Framework for Learning Transportable, Shift-Stable Models

by Subbaswamy, Adarsh , Saria, Suchi in Algorithms , Environment models , Supervised learning

2020

Shifts in environment between development and deployment cause classical supervised learning to produce models that fail to generalize well to new target distributions. Recently, many solutions which find invariant predictive distributions have been developed. Among these, graph-based approaches do not require data from the target environment and can capture more stable information than alternative methods which find stable feature sets. However, these approaches assume that the data generating process is known in the form of a full causal graph, which is generally not the case. In this paper, we propose I-SPEC, an end-to-end framework that addresses this shortcoming by using data to learn a partial ancestral graph (PAG). Using the PAG we develop an algorithm that determines an interventional distribution that is stable to the declared shifts; this subsumes existing approaches which find stable feature sets that are less accurate. We apply I-SPEC to a mortality prediction problem to show it can learn a model that is robust to shifts without needing upfront knowledge of the full causal DAG.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter