Asset Details

MbrlCatalogueTitleDetail

Do you wish to reserve the book?

Autorubric: Unifying Rubric-based LLM Evaluation

by Callison-Burch, Chris , Rao, Delip

in Benchmarks / Bias / Calibration / Chatbots / Checkpointing / Correlation coefficients / Criteria / Large language models / Natural language processing / Psychometrics / Quality assessment

2026

Yes Please

Hey, we have placed the reservation for you!

By the way, why not check out events that you can attend while you pick your title.

Oops! Something went wrong.

Looks like we were not able to place the reservation. Kindly try again later.

Do you wish to request the book?

Autorubric: Unifying Rubric-based LLM Evaluation

by Callison-Burch, Chris , Rao, Delip

in Benchmarks / Bias / Calibration / Chatbots / Checkpointing / Correlation coefficients / Criteria / Large language models / Natural language processing / Psychometrics / Quality assessment

2026

Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy

How would you like to get it?

Submit

We have requested the book for you!

Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.

Oops! Something went wrong.

Looks like we were not able to place your request. Kindly try again later.

Paper

Autorubric: Unifying Rubric-based LLM Evaluation

Callison-Burch, Chris,

Rao, Delip

2026

Overview

Techniques for reliable rubric-based LLM evaluation -- ensemble judging, bias mitigation, few-shot calibration -- are scattered across papers with inconsistent terminology and partial implementations. We introduce Autorubric, an open-source framework that unifies these rubric-based LLM evaluation lessons with opinionated defaults: analytic rubrics with binary, ordinal, and nominal criteria; single-judge and ensemble evaluation; few-shot calibration; bias mitigations; and psychometric reliability metrics. We validate on three benchmarks: RiceChem (college chemistry grading, 80\\% accuracy with 5-shot calibration), ResearcherBench (deep research evaluation, 931 criteria, cross-judge agreement analysis), and CHARM-100, a new chatbot evaluation dataset combining all three criterion types with ground truth labels (87\\% binary accuracy, moderate-to-substantial \\(\\kappa\\)). Beyond measurement, per-criterion scores and explanations serve as optimization signals. We demonstrate how Autorubric's rubric-evaluation explanations raise a peer review agent's score from 0.47 to 0.85 (above the 0.82 expert-curated baseline), and its scores serve as RL rewards to produce statistically significant improvement on AdvancedIF (+0.039, Wilcoxon \\(p = 0.032\\)) with positive transfer to IFEval. In all of these cases, Autorubric enabled us to rapidly operationalize various rubric design choices and best practices with minimal effort.

Share this book

Add to My Shelf

Publisher

Cornell University Library, arXiv.org

Subject

/ Bias

/ Correlation coefficients

/ Criteria