Catalogue Search | MBRL

Understanding the Limitations of Using Large Language Models for Text Generation

by Ippolito, Daphne in Artificial intelligence , Computer science , Linguistics

2023

State-of-the-art neural language models are capable of generating incredibly fluent English text. This success provides opportunities for novel forms of interaction, where human writers work collaboratively with a natural-language generation system toward a set of goals. However, it also poses several challenges. Evaluating and comparing the skill of different open-ended text generation systems is challenging, and generated text can have negative societal impact if it proliferates and is not detectable by humans. In this dissertation, I introduce a detection-based evaluation task that can be used to compare different language models and generative configuations. By both asking humans to complete this task and training automatic classifier to complete it, I investigate how the tradeoff between generating high-quality and generating diverse text impacts detectability. Through subsequent large-scale user studies, I show that factors such as the model size and the topic of the generation can have significant influence on human detection capabability. I show how large neural language models’ capability of memorizing large swaths of their training data complicates our ability to evaluate their skill at generating high-quality novel text. I also show how, despite these challenges, neural language models can be successfully employed to support creative writing tasks. In particular, I describe methods for performing style transfer into any user-provided style and for efficiently supporting fill-in-the-blank operations in addition to the more standard continuation operation. Finally, I introduce an interactive writing tool we built which allows creative writers to collaborate with a natural language generation system to craft stories. Users studies with both novice and professional writers provide insights into the strengths and limitations of applying natural language generation systems in real-world settings.

Dissertation

Share this book

Add to My Shelf

Generative AI and science communication in the physical sciences

by Biyela, Sibusiso , Schäfer, Mike S , Yokoyama, Hiromi M in African languages , Audiences , Chatbots

2024

Advances in generative AI could democratize science communication, by providing scientists with easy-to-use tools to help them communicate their work to different audiences. However, these tools are imperfect, and their output must be checked by experts. They can also be used maliciously to produce misinformation and disinformation. Seven researchers and science communicators weigh up the potential benefits of generative AI for science communication against its risks.Seven researchers and science communicators weigh up the potential benefits of generative AI for science communication against its risks.

Journal Article

Share this book

Add to My Shelf

Chasing Random: Instruction Selection Strategies Fail to Generalize

by Diddee, Harshita , Ippolito, Daphne in Data analysis , Datasets , Performance evaluation

2024

Prior work has shown that language models can be tuned to follow user instructions using only a small set of high-quality instructions. This has accelerated the development of methods that filter a large, noisy instruction-tuning datasets down to high-quality subset which works just as well. However, typically, the performance of these methods is not demonstrated across a uniform experimental setup and thus their generalization capabilities are not well established. In this work, we analyze popular selection strategies across different source datasets, selection budgets and evaluation benchmarks: Our results indicate that selection strategies generalize poorly, often failing to consistently outperform even random baselines. We also analyze the cost-performance trade-offs of using data selection. Our findings reveal that data selection can often exceed the cost of fine-tuning on the full dataset, yielding only marginal and sometimes no gains compared to tuning on the full dataset or a random subset.

Paper

Share this book

Add to My Shelf

Measuring Reasoning Trace Legibility: Can Those Who Understand Teach?

by Sridhar, Shreya , Roytburg, Dani , Ippolito, Daphne in Legibility , Multiagent systems , Reasoning

2026

Language models are increasingly being trained to \"reason\" before answering users' queries, outputting hundreds or even thousands of tokens worth of deliberation before their final answer. While the main intention of reasoning is to improve models' ability to arrive at a correct answer, we argue that these models should be assessed for the legibility of their reasoning traces in addition to the correctness of their final answers. In this paper, we evaluate 90k traces from 12 Reasoning Language Models (RLMs) for the quality of their reasoning traces. We introduce the concept of transfer utility, which assesses how useful an RLM's reasoning traces are for guiding a weaker, non-reasoning model toward arriving at the correct answer. We find that the reasoning traces of the highest-performing models rank among the lowest for legibility. Furthermore, we uncover tensions between efficiency-based measurements of legibility (such as trace length) and transfer utility. These tensions establish a legibility Pareto frontier, and we demonstrate that an RLM's ability to output highly legible traces can be a task- and audience-dependent goal. Crucially, we find that reward models used to train RLMs do not intrinsically reward legibility. Together, these metrics and the findings they surface chart a path towards scaffolding reasoning traces for a multi-agent future.

Paper

Share this book

Add to My Shelf

Customizing Large Language Model Generation Style using Parameter-Efficient Finetuning

by Liu, Xinyue , Diddee, Harshita , Ippolito, Daphne in Customization , Large language models , Parameters

2024

One-size-fits-all large language models (LLMs) are increasingly being used to help people with their writing. However, the style these models are trained to write in may not suit all users or use cases. LLMs would be more useful as writing assistants if their idiolect could be customized to match each user. In this paper, we explore whether parameter-efficient finetuning (PEFT) with Low-Rank Adaptation can effectively guide the style of LLM generations. We use this method to customize LLaMA-2 to ten different authors and show that the generated text has lexical, syntactic, and surface alignment with the target author but struggles with content memorization. Our findings highlight the potential of PEFT to support efficient, user-level customization of LLMs.

Paper

Share this book

Add to My Shelf

BenchBrowser: Retrieving Evidence for Evaluating Benchmark Validity

by Yauney, Gregory , Swayamdipta, Swabha , Diddee, Harshita in Benchmarks

2026

Do language model benchmarks actually measure what practitioners intend them to ? High-level metadata is too coarse to convey the granular reality of benchmarks: a \"poetry\" benchmark may never test for haikus, while \"instruction-following\" benchmarks will often test for an arbitrary mix of skills. This opacity makes verifying alignment with practitioner goals a laborious process, risking an illusion of competence even when models fail on untested facets of user interests. We introduce BenchBrowser, a retriever that surfaces evaluation items relevant to natural language use cases over 20 benchmark suites. Validated by a human study confirming high retrieval precision, BenchBrowser generates evidence to help practitioners diagnose low content validity (narrow coverage of a capability's facets) and low convergent validity (lack of stable rankings when measuring the same capability). BenchBrowser, thus, helps quantify a critical gap between practitioner intent and what benchmarks actually test.

Paper

Share this book

Add to My Shelf

No Single Best Model for Diversity: Learning a Router for Sample Diversity

by Padmakumar, Vishakh , Xu, Fangyuan , Choi, Eunsol

2026

When posed with prompts that permit a large number of valid answers, comprehensively generating them is the first step towards satisfying a wide range of users. In this paper, we study methods to elicit a comprehensive set of valid responses. To evaluate this, we introduce diversity coverage, a metric that measures the total quality scores assigned to each unique answer in the predicted answer set relative to the best possible answer set with the same number of answers. Using this metric, we evaluate 18 LLMs, finding no single model dominates at generating diverse responses to a wide range of open-ended prompts. Yet, per each prompt, there exists a model that outperforms all other models significantly at generating a diverse answer set. Motivated by this finding, we introduce a router that predicts the best model for each query. On NB-Wildchat, our trained router outperforms the single best model baseline (26.3% vs $23.8%). We further show generalization to an out-of-domain dataset (NB-Curated) as well as different answer-generation prompting strategies. Our work lays foundation for studying generating comprehensive answers when we have access to a suite of models.

Paper

Share this book

Add to My Shelf

Effective Prompt Extraction from Language Models

by Zhang, Yiming , Carlini, Nicholas , Ippolito, Daphne in Large language models , Queries

2024

The text generated by large language models is commonly controlled by prompting, where a prompt prepended to a user's query guides the model's output. The prompts used by companies to guide their models are often treated as secrets, to be hidden from the user making the query. They have even been treated as commodities to be bought and sold on marketplaces. However, anecdotal reports have shown adversarial users employing prompt extraction attacks to recover these prompts. In this paper, we present a framework for systematically measuring the effectiveness of these attacks. In experiments with 3 different sources of prompts and 11 underlying large language models, we find that simple text-based attacks can in fact reveal prompts with high probability. Our framework determines with high precision whether an extracted prompt is the actual secret prompt, rather than a model hallucination. Prompt extraction from real systems such as Claude 3 and ChatGPT further suggest that system prompts can be revealed by an adversary despite existing defenses in place.

Paper

Share this book

Add to My Shelf

FictionalQA: A Dataset for Studying Memorization and Knowledge Acquisition

by Mongkolsupawan, Janny , Goldstein, Tom , Kirchenbauer, John in Datasets , Documents , Knowledge acquisition

2026

When language models are trained on textual data, they acquire both knowledge about the structure of language as well as knowledge of facts about the world. At inference time, their knowledge of facts can be leveraged to solve interesting problems and perform useful knowledge work for users. It is well known that language models can verbatim memorize long sequences from their training data. However, it is much less well understood how language models memorize facts seen during training. In this work, we propose a new dataset to specifically empower researchers to study the dual processes of fact memorization and verbatim sequence memorization. The dataset consists of synthetically-generated, webtext-like documents about fictional events, as well as question-answer pairs about the events. We conduct training experiments showing how synthetic data about fictional events can be useful for studying different forms of memorization. We also document some challenges in effectively building realistic, fictional synthetic data.

Paper

Share this book

Add to My Shelf

On Code-Induced Reasoning in LLMs

by Rosé, Carolyn , Wu, Zhen , Waheed, Abdul in Large language models , Natural language , Natural language processing

2025

Code data has been shown to enhance the reasoning capabilities of large language models (LLMs), but it remains unclear which aspects of code are most responsible. We investigate this question with a systematic, data-centric framework. We construct parallel instruction datasets in ten programming languages and apply controlled perturbations that selectively disrupt structural or semantic properties of code. We then finetune LLMs from five model families and eight scales on each variant and evaluate their performance on natural language, math, and code tasks. Across 3,331 experiments, our results show that LLMs are more vulnerable to structural perturbations than semantic ones, particularly on math and code tasks. Appropriate abstractions like pseudocode and flowcharts can be as effective as code, while encoding the same information with fewer tokens without adhering to original syntax can often retain or even improve performance. Remarkably, even corrupted code with misleading signals remains competitive when surface-level regularities persist. Finally, syntactic styles also shape task-specific gains with Python favoring natural language reasoning and lower-level languages such as Java and Rust favoring math. Through our systematic framework, we aim to provide insight into how different properties of code influence reasoning and inform the design of training data for enhancing LLM reasoning capabilities.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter