Catalogue Search | MBRL

Mitigating the impact of biased artificial intelligence in emergency decision-making

by Adam, Hammaad , Ghassemi, Marzyeh , Alsentzer, Emily in 692/700 , 692/700/228 , Artificial intelligence

2022

Background Prior research has shown that artificial intelligence (AI) systems often encode biases against minority subgroups. However, little work has focused on ways to mitigate the harm discriminatory algorithms can cause in high-stakes settings such as medicine. Methods In this study, we experimentally evaluated the impact biased AI recommendations have on emergency decisions, where participants respond to mental health crises by calling for either medical or police assistance. We recruited 438 clinicians and 516 non-experts to participate in our web-based experiment. We evaluated participant decision-making with and without advice from biased and unbiased AI systems. We also varied the style of the AI advice, framing it either as prescriptive recommendations or descriptive flags. Results Participant decisions are unbiased without AI advice. However, both clinicians and non-experts are influenced by prescriptive recommendations from a biased algorithm, choosing police help more often in emergencies involving African-American or Muslim men. Crucially, using descriptive flags rather than prescriptive recommendations allows respondents to retain their original, unbiased decision-making. Conclusions Our work demonstrates the practical danger of using biased models in health contexts, and suggests that appropriately framing decision support can mitigate the effects of AI bias. These findings must be carefully considered in the many real-world clinical scenarios where inaccurate or biased models may be used to inform important decisions. Plain language summary Artificial intelligence (AI) systems that make decisions based on historical data are increasingly common in health care settings. However, many AI models exhibit problematic biases, as data often reflect human prejudices against minority groups. In this study, we used a web-based experiment to evaluate the impact biased models can have when used to inform human decisions. We found that though participants were not inherently biased, they were strongly influenced by advice from a biased model if it was offered prescriptively (i.e., “you should do X”). This adherence led their decisions to be biased against African-American and Muslims individuals. However, framing the same advice descriptively (i.e., without recommending a specific action) allowed participants to remain fair. These results demonstrate that though discriminatory AI can lead to poor outcomes for minority groups, appropriately framing advice can help mitigate its effects. Adam et al. evaluate the impact of biased AI recommendations on emergency decisions made by respondents to mental health crises. They find that descriptive rather than prescriptive recommendations made by the AI decision support system are more likely to lead to unbiased decision-making.

Journal Article

Share this book

Add to My Shelf

Simulation of undiagnosed patients with novel genetic conditions

by Kohane, Isaac S. , Alsentzer, Emily , Kobren, Shilpa N. in 631/114/2401 , 631/114/2785 , 631/208/2489

2023

Rare Mendelian disorders pose a major diagnostic challenge and collectively affect 300–400 million patients worldwide. Many automated tools aim to uncover causal genes in patients with suspected genetic disorders, but evaluation of these tools is limited due to the lack of comprehensive benchmark datasets that include previously unpublished conditions. Here, we present a computational pipeline that simulates realistic clinical datasets to address this deficit. Our framework jointly simulates complex phenotypes and challenging candidate genes and produces patients with novel genetic conditions. We demonstrate the similarity of our simulated patients to real patients from the Undiagnosed Diseases Network and evaluate common gene prioritization methods on the simulated cohort. These prioritization methods recover known gene-disease associations but perform poorly on diagnosing patients with novel genetic disorders. Our publicly-available dataset and codebase can be utilized by medical genetics researchers to evaluate, compare, and improve tools that aid in the diagnostic process. Rare Mendelian disorders pose a major diagnostic challenge, but evaluation of automated tools that aim to uncover causal genes tools is limited. Here, the authors present a computational pipeline that simulates realistic clinical datasets to address this deficit.

Journal Article

Share this book

Add to My Shelf

Zero-shot interpretable phenotyping of postpartum hemorrhage using large language models

by Rasmussen, Matthew J , Gray, Kathryn J , Cull, Alexis L in Artificial intelligence , Digital technology , Genotype & phenotype

2023

Many areas of medicine would benefit from deeper, more accurate phenotyping, but there are limited approaches for phenotyping using clinical notes without substantial annotated data. Large language models (LLMs) have demonstrated immense potential to adapt to novel tasks with no additional training by specifying task-specific instructions. Here we report the performance of a publicly available LLM, Flan-T5, in phenotyping patients with postpartum hemorrhage (PPH) using discharge notes from electronic health records (n = 271,081). The language model achieves strong performance in extracting 24 granular concepts associated with PPH. Identifying these granular concepts accurately allows the development of interpretable, complex phenotypes and subtypes. The Flan-T5 model achieves high fidelity in phenotyping PPH (positive predictive value of 0.95), identifying 47% more patients with this complication compared to the current standard of using claims codes. This LLM pipeline can be used reliably for subtyping PPH and outperforms a claims-based approach on the three most common PPH subtypes associated with uterine atony, abnormal placentation, and obstetric trauma. The advantage of this approach to subtyping is its interpretability, as each concept contributing to the subtype determination can be evaluated. Moreover, as definitions may change over time due to new guidelines, using granular concepts to create complex phenotypes enables prompt and efficient updating of the algorithm. Using this language modelling approach enables rapid phenotyping without the need for any manually annotated training data across multiple clinical use cases.

Journal Article

Share this book

Add to My Shelf

TIMER: temporal instruction modeling and evaluation for longitudinal clinical records

by Cui, Hejie , Fries, Jason Alan , Shah, Nigam H. in 639/705/117 , 692/700 , Biomedicine

2025

Electronic health records (EHRs) contain rich longitudinal information for clinical decision-making, yet LLMs struggle to reason across patient timelines. We introduce TIMER ( T emporal I nstruction M odeling and E valuation for Longitudinal Clinical R ecords), a method to improve LLMs’ temporal reasoning over multi-visit EHRs through time-aware instruction tuning. TIMER grounds LLMs in patient-specific temporal contexts by linking each instruction-response pair to specific timestamps, ensuring temporal fidelity throughout the training process. Evaluations show that TIMER-tuned models outperform conventional medical instruction-tuned approaches by 6.6% in completeness on clinician-curated benchmarks, with distribution-matched training demonstrating advantages up to 6.5% in temporal reasoning. Qualitative analyses reveal that using TIMER enhances temporal boundary adherence, trend detection, and chronological precision, necessary for applications such as disease trajectory modeling and treatment response monitoring. Overall, TIMER provides a methodological basis for developing LLMs that can effectively engage with the inherently longitudinal nature of data for patient care. Code is available at TIMER .

Journal Article

Share this book

Add to My Shelf

Few shot learning for phenotype-driven diagnosis of patients with rare genetic diseases

by Noori, Ayush , Kohane, Isaac S. , Alsentzer, Emily in 631/114/1305 , 639/705/117 , 692/308

2025

There are over 7000 rare diseases, some affecting 3500 or fewer patients in the United States. Due to clinicians’ limited experience with such diseases and the heterogeneity of clinical presentations, ~70% of individuals seeking a diagnosis remain undiagnosed. Deep learning has demonstrated success in aiding the diagnosis of common diseases. However, existing approaches require labeled datasets with thousands of diagnosed patients per disease. We present SHEPHERD, a few-shot learning approach for multi-faceted rare disease diagnosis. SHEPHERD performs deep learning over a knowledge graph enriched with rare disease information and is trained on a dataset of simulated rare disease patients. We demonstrate SHEPHERD's effectiveness across diverse diagnostic tasks, performing causal gene discovery, retrieving “patients-like-me”, and characterizing novel disease presentations, using real-world cohorts from the Undiagnosed Diseases Network (N = 465), MyGene2 ( N = 146), and the Deciphering Developmental Disorders study ( N = 1431). SHEPHERD demonstrates the potential of knowledge-grounded deep learning to accelerate rare disease diagnosis.

Journal Article

Share this book

Add to My Shelf

Synthetic data distillation enables the extraction of clinical information at scale

by Beaulieu-Jones, Brett K. , Alsentzer, Emily , Woo, Elizabeth Geena in 631/114/1305 , 692/308/2779/109 , Ablation

2025

Large-language models (LLMs) show promise for clinical note information extraction, but deployment challenges include high computational costs and privacy concerns. We used synthetic data distillation to fine-tune smaller, open-source LLMs to achieve performance comparable to larger models while enabling local hardware deployment or reduced cloud costs. Using Llama-3.1-70B-Instruct, we generated synthetic question-answer training pairs to fine-tune smaller Llama models. We evaluated performance across three tasks: synthetic clinical trial criteria, the i2b2 2018 Clinical Trial Eligibility Challenge, and apixaban trial criteria questions. The 8B-parameter model achieved high accuracy across all tasks and sometimes outperformed the 70B-Instruct teacher model. Fine-tuning with only the most challenging questions still improved performance, demonstrating the value of targeted training. Results from 3B- and 1B-parameter models showed a clear size-performance tradeoff. This work demonstrates synthetic data distillation’s potential for enabling scalable clinical information extraction.

Journal Article

Share this book

Add to My Shelf

Assessing 3 Outbreak Detection Algorithms in an Electronic Syndromic Surveillance System in a Resource-Limited Setting

by Quispe, Jose , Loayza, Luis , Alsentzer, Emily in acute diarrheal disease , Algorithms , Assessing 3 Outbreak Detection Algorithms in an Electronic Syndromic Surveillance System in a Resource-Limited Setting

2020

We evaluated the performance of X-bar chart, exponentially weighted moving average, and C3 cumulative sums aberration detection algorithms for acute diarrheal disease syndromic surveillance at naval sites in Peru during 2007-2011. The 3 algorithms' detection sensitivity was 100%, specificity was 97%-99%, and positive predictive value was 27%-46%.

Journal Article

Share this book

Add to My Shelf

Understanding contraceptive switching rationales from real world clinical notes using large language models

by Miao, Brenda Y. , Chinedu-Eneh, Ebenezer , Zack, Travis in 692/308 , 692/499 , 692/700/478

2025

Understanding reasons for treatment switching is of significant medical interest, but these factors are often only found in unstructured clinical notes and can be difficult to extract. We evaluated the zero-shot abilities of GPT-4 and eight other open-source large language models (LLMs) to extract contraceptive switching information from 1964 clinical notes derived from the UCSF Information Commons dataset. GPT-4 extracted the contraceptives started and stopped at each switch with microF1 scores of 0.85 and 0.88, respectively, compared to 0.81 and 0.88 for the best open-source model. When evaluated by clinical experts, GPT-4 extracted reasons for switching with an accuracy of 91.4% (2.2% hallucination rate). Transformer-based topic modeling identified patient preference, adverse events, and insurance coverage as key reasons. These findings demonstrate the value of LLMs in identifying complex treatment factors and provide insights into reasons for contraceptive switching in real-world settings.

Journal Article

Share this book

Add to My Shelf

The effect of microbial colonization on the host proteome varies by gastrointestinal location

by Elias, Joshua E , Jaffe, Mia , Alsentzer, Emily in 631/326 , 631/337/475 , 692/698/2741/2135

2016

Endogenous intestinal microbiota have wide-ranging and largely uncharacterized effects on host physiology. Here, we used reverse-phase liquid chromatography-coupled tandem mass spectrometry to define the mouse intestinal proteome in the stomach, jejunum, ileum, cecum and proximal colon under three colonization states: germ-free (GF), monocolonized with Bacteroides thetaiotaomicron and conventionally raised (CR). Our analysis revealed distinct proteomic abundance profiles along the gastrointestinal (GI) tract. Unsupervised clustering showed that host protein abundance primarily depended on GI location rather than colonization state and specific proteins and functions that defined these locations were identified by random forest classifications. K-means clustering of protein abundance across locations revealed substantial differences in host protein production between CR mice relative to GF and monocolonized mice. Finally, comparison with fecal proteomic data sets suggested that the identities of stool proteins are not biased to any region of the GI tract, but are substantially impacted by the microbiota in the distal colon.

Journal Article

Share this book

Add to My Shelf

To do no harm — and the most good — with AI in health care

by Hoffman, Sara M. , Manrai, Arjun Kumar , Brennan, Patricia Flatley in 692/700 , 706/648/453 , 706/703/559

2024

Drawing from real-life scenarios and insights shared at the RAISE (Responsible AI for Social and Ethical Healthcare) conference, we highlight the critical need for AI in health care (AIH) to primarily benefit patients and address current shortcomings in health care systems such as medical errors and access disparities.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter