Catalogue Search | MBRL

Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers

by Zhang, Sean , Subbiah, Melanie , Chilton, Lydia B. in Collaboration , Errors , Language

2024

We evaluate recent Large Language Models (LLMs) on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the authors themselves. Through quantitative and qualitative analysis grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We find that all three models make faithfulness mistakes in over 50% of summaries and struggle with specificity and interpretation of difficult subtext. We additionally demonstrate that LLM ratings and other automatic metrics for summary quality do not correlate well with the quality ratings from the writers.

Journal Article

Share this book

Add to My Shelf

Cord blood T cell subpopulations and associations with maternal cadmium and arsenic exposures

by Li, Zhigang , Sampath, Vanitha , Jackson, Brian in Adult , Analysis , Arsenic

2017

Arsenic and cadmium are environmental pollutants, and although the evidence for adverse immune effects after prenatal arsenic and cadmium exposures is increasing, little is known about the underlying immunological mechanisms. We investigated the relationship between prenatal arsenic and cadmium exposures and a variety of T cell subpopulations measured in cord blood for 63 participants in the New Hampshire Birth Cohort Study. Post-partum toenail concentrations of arsenic and cadmium were used as an estimate of maternal exposure during pregnancy. The characteristics of cord blood proportions of T lymphocytes and subpopulations (expression of markers for Th1, Th2, Th17, Th1Th17, induced and natural regulatory T cells and NKTs) are presented. In regression analyses, maternal arsenic exposure levels were inversely associated with cord blood T helper memory cells (-21%, 95% CI: -36%, -3%) and the association was found to be stronger in females. They were also inversely associated with activated T helper memory cells, particularly in males (-26%, 95% CI: -43%, -3%). Similarly, inverse associations were observed between cadmium exposure levels and activated T helper memory cells (-16%, 95% CI: -30%, -1%) and also for T helper memory cells in females (-20%, 95% CI: -35%, -3%). The results suggest that prenatal exposures to relatively low levels of arsenic and cadmium may contribute to altered distribution of T cell populations at birth. These changes in theory, could have contributed to the previously reported immunosuppressive effects observed later in infancy/childhood.

Journal Article

Share this book

Add to My Shelf

Computational Representations of Character Significance in Novels

by Mian, Haaris , Subbiah, Melanie , Shaalan, Nora in Graph representations , Graphical representations , Structural models

2026

Characters in novels have typically been modeled based on their presence in scenes in narrative, considering aspects like their actions, named mentions, and dialogue. This conception of character places significant emphasis on the main character who is present in the most scenes. In this work, we instead adopt a framing developed from a new literary theory proposing a six-component structural model of character. This model enables a comprehensive approach to character that accounts for the narrator-character distinction and includes a component neglected by prior methods, discussion by other characters. We compare general-purpose LLMs with task-specific transformers for operationalizing this model of character on major 19th-century British realist novels. Our methods yield both component-level and graph representations of character discussion. We then demonstrate that these representations allow us to approach literary questions at scale from a new computational lens. Specifically, we explore Woloch's classic \"the one vs the many\" theory of character centrality and the gendered dynamics of character discussion.

Paper

Share this book

Add to My Shelf

Counterfactual Simulatability of LLM Explanations for Generation Tasks

by Chen, Yanda , Subbiah, Melanie , McKeown, Kathleen

2025

LLMs can be unpredictable, as even slight alterations to the prompt can cause the output to change in unexpected ways. Thus, the ability of models to accurately explain their behavior is critical, especially in high-stakes settings. One approach for evaluating explanations is counterfactual simulatability, how well an explanation allows users to infer the model's output on related counterfactuals. Counterfactual simulatability has been previously studied for yes/no question answering tasks. We provide a general framework for extending this method to generation tasks, using news summarization and medical suggestion as example use cases. We find that while LLM explanations do enable users to better predict LLM outputs on counterfactuals in the summarization setting, there is significant room for improvement for medical suggestion. Furthermore, our results suggest that the evaluation for counterfactual simulatability may be more appropriate for skill-based tasks as opposed to knowledge-based tasks.

Paper

Share this book

Add to My Shelf

Unsupervised Selective Rationalization with Noise Injection

by Subbiah, Melanie , McKeown, Kathleen , Storek, Adam in Benchmarks , Noise prediction

2023

A major issue with using deep learning models in sensitive applications is that they provide no explanation for their output. To address this problem, unsupervised selective rationalization produces rationales alongside predictions by chaining two jointly-trained components, a rationale generator and a predictor. Although this architecture guarantees that the prediction relies solely on the rationale, it does not ensure that the rationale contains a plausible explanation for the prediction. We introduce a novel training technique that effectively limits generation of implausible rationales by injecting noise between the generator and the predictor. Furthermore, we propose a new benchmark for evaluating unsupervised selective rationalization models using movie reviews from existing datasets. We achieve sizeable improvements in rationale plausibility and task accuracy over the state-of-the-art across a variety of tasks, including our new benchmark, while maintaining or improving model faithfulness.

Paper

Share this book

Add to My Shelf

Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives

by Subbiah, Melanie , Mian, Haaris , McAdams, Dan P in Human bias , Labels , Large language models

2026

Increasingly, studies are exploring using Large Language Models (LLMs) for accelerated or scaled qualitative analysis of text data. While we can compare LLM accuracy against human labels directly for deductive coding, or labeling text, it is more challenging to judge the ethics and effectiveness of using LLMs in abstractive methods such as inductive thematic analysis. We collaborate with psychologists to study the abstractive claims LLMs make about human life stories, asking, how does using an LLM as an interpreter of meaning affect the conclusions and perspectives of a study? We propose a summarization-based pipeline for surfacing biases in perspective-taking an LLM might employ in interpreting these life stories. We demonstrate that our pipeline can identify both race and gender bias with the potential for representational harm. Finally, we encourage the use of this analysis in future studies involving LLM-based interpretation of study participants' written text or transcribed speech to characterize a positionality portrait for the study.

Paper

Share this book

Add to My Shelf

Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers

by Zhang, Sean , Subbiah, Melanie , Chilton, Lydia B in Large language models , Qualitative analysis

2024

We evaluate recent Large Language Models (LLMs) on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the authors themselves. Through quantitative and qualitative analysis grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We find that all three models make faithfulness mistakes in over 50% of summaries and struggle with specificity and interpretation of difficult subtext. We additionally demonstrate that LLM ratings and other automatic metrics for summary quality do not correlate well with the quality ratings from the writers.

Paper

Share this book

Add to My Shelf

AdvSumm: Adversarial Training for Bias Mitigation in Text Summarization

by Subbiah, Melanie , McKeown, Kathleen , Nikhil Reddy Varimalla in Bias , Data augmentation , Large language models

2025

Large Language Models (LLMs) have achieved impressive performance in text summarization and are increasingly deployed in real-world applications. However, these systems often inherit associative and framing biases from pre-training data, leading to inappropriate or unfair outputs in downstream tasks. In this work, we present AdvSumm (Adversarial Summarization), a domain-agnostic training framework designed to mitigate bias in text summarization through improved generalization. Inspired by adversarial robustness, AdvSumm introduces a novel Perturber component that applies gradient-guided perturbations at the embedding level of Sequence-to-Sequence models, enhancing the model's robustness to input variations. We empirically demonstrate that AdvSumm effectively reduces different types of bias in summarization-specifically, name-nationality bias and political framing bias-without compromising summarization quality. Compared to standard transformers and data augmentation techniques like back-translation, AdvSumm achieves stronger bias mitigation performance across benchmark datasets.

Paper

Share this book

Add to My Shelf

Is the Top Still Spinning? Evaluating Subjectivity in Narrative Understanding

by Tang, Liyan , Subbiah, Melanie , Kim, Grace in Ambiguity , Subjectivity

2025

Determining faithfulness of a claim to a source document is an important problem across many domains. This task is generally treated as a binary judgment of whether the claim is supported or unsupported in relation to the source. In many cases, though, whether a claim is supported can be ambiguous. For instance, it may depend on making inferences from given evidence, and different people can reasonably interpret the claim as either supported or unsupported based on their agreement with those inferences. Forcing binary labels upon such claims lowers the reliability of evaluation. In this work, we reframe the task to manage the subjectivity involved with factuality judgments of ambiguous claims. We introduce LLM-generated edits of summaries as a method of providing a nuanced evaluation of claims: how much does a summary need to be edited to be unambiguous? Whether a claim gets rewritten and how much it changes can be used as an automatic evaluation metric, the Ambiguity Rewrite Metric (ARM), with a much richer feedback signal than a binary judgment of faithfulness. We focus on the area of narrative summarization as it is particularly rife with ambiguity and subjective interpretation. We show that ARM produces a 21% absolute improvement in annotator agreement on claim faithfulness, indicating that subjectivity is reduced.

Paper

Share this book

Add to My Shelf

Guiding LLM Decision-Making with Fairness Reward Models

by Subbiah, Melanie , Zemel, Richard , McKeown, Kathleen in Accuracy , Decision making , Large language models

2025

Large language models are increasingly used to support high-stakes decisions, potentially influencing who is granted bail or receives a loan. Naive chain-of-thought sampling can improve average decision accuracy, but has also been shown to amplify unfair bias. To address this challenge and enable the trustworthy use of reasoning models in high-stakes decision-making, we propose a framework for training a generalizable Fairness Reward Model (FRM). Our model assigns a fairness score to LLM reasoning, enabling the system to down-weight biased trajectories and favor equitable ones when aggregating decisions across reasoning chains. We show that a single Fairness Reward Model, trained on weakly supervised, LLM-annotated examples of biased versus unbiased reasoning, transfers across tasks, domains, and model families without additional fine-tuning. Applied to real-world decision-making tasks including recidivism prediction and social media moderation, we show that our approach consistently improves fairness while matching, or even surpassing, baseline accuracy.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter