Asset Details

MbrlCatalogueTitleDetail

Do you wish to reserve the book?

Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora

by Kaplan, Frédéric , Karch, Tristan , Schwaller, Philippe , Engel, Luca

in Data acquisition / Datasets / Information sources / Large language models

2026

Yes Please

Hey, we have placed the reservation for you!

By the way, why not check out events that you can attend while you pick your title.

Oops! Something went wrong.

Looks like we were not able to place the reservation. Kindly try again later.

Paper

Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora

Kaplan, Frédéric,

Karch, Tristan,

Schwaller, Philippe,

Engel, Luca

2026

Overview

As large language models (LLMs) converge towards similar capabilities, the key to advancing their performance lies in identifying and incorporating valuable new information sources. However, evaluating which text collections are worth the substantial investment required for digitization, preprocessing, and integration into LLM systems remains a significant challenge. We present a novel approach to this challenge: an automated pipeline that evaluates the potential information gain from text collections without requiring model training or fine-tuning. Our method generates multiple choice questions (MCQs) from texts and measures an LLM's performance both with and without access to the source material. The performance gap between these conditions serves as a proxy for the collection's information potential. We validate our approach using five strategically selected datasets: EPFL PhD manuscripts, a private collection of Venetian historical records, two sets of Wikipedia articles on related topics, and a synthetic baseline dataset. Our results demonstrate that this method effectively identifies collections containing valuable novel information, providing a practical tool for prioritizing data acquisition and integration efforts.

Share this book

Add to My Shelf

Publisher

Cornell University Library, arXiv.org

Subject

Data acquisition

/ Datasets

/ Information sources

/ Large language models