Asset Details
MbrlCatalogueTitleDetail
Do you wish to reserve the book?
Autorubric: Unifying Rubric-based LLM Evaluation
by
Callison-Burch, Chris
, Rao, Delip
in
Benchmarks
/ Bias
/ Calibration
/ Chatbots
/ Checkpointing
/ Correlation coefficients
/ Criteria
/ Large language models
/ Natural language processing
/ Psychometrics
/ Quality assessment
2026
Hey, we have placed the reservation for you!
By the way, why not check out events that you can attend while you pick your title.
You are currently in the queue to collect this book. You will be notified once it is your turn to collect the book.
Oops! Something went wrong.
Looks like we were not able to place the reservation. Kindly try again later.
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
Do you wish to request the book?
Autorubric: Unifying Rubric-based LLM Evaluation
by
Callison-Burch, Chris
, Rao, Delip
in
Benchmarks
/ Bias
/ Calibration
/ Chatbots
/ Checkpointing
/ Correlation coefficients
/ Criteria
/ Large language models
/ Natural language processing
/ Psychometrics
/ Quality assessment
2026
Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy
We have requested the book for you!
Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.
Oops! Something went wrong.
Looks like we were not able to place your request. Kindly try again later.
Paper
Autorubric: Unifying Rubric-based LLM Evaluation
2026
Request Book From Autostore
and Choose the Collection Method
Overview
Techniques for reliable rubric-based LLM evaluation -- ensemble judging, bias mitigation, few-shot calibration -- are scattered across papers with inconsistent terminology and partial implementations. We introduce Autorubric, an open-source framework that unifies these rubric-based LLM evaluation lessons with opinionated defaults: analytic rubrics with binary, ordinal, and nominal criteria; single-judge and ensemble evaluation; few-shot calibration; bias mitigations; and psychometric reliability metrics. We validate on three benchmarks: RiceChem (college chemistry grading, 80\\% accuracy with 5-shot calibration), ResearcherBench (deep research evaluation, 931 criteria, cross-judge agreement analysis), and CHARM-100, a new chatbot evaluation dataset combining all three criterion types with ground truth labels (87\\% binary accuracy, moderate-to-substantial \\(\\kappa\\)). Beyond measurement, per-criterion scores and explanations serve as optimization signals. We demonstrate how Autorubric's rubric-evaluation explanations raise a peer review agent's score from 0.47 to 0.85 (above the 0.82 expert-curated baseline), and its scores serve as RL rewards to produce statistically significant improvement on AdvancedIF (+0.039, Wilcoxon \\(p = 0.032\\)) with positive transfer to IFEval. In all of these cases, Autorubric enabled us to rapidly operationalize various rubric design choices and best practices with minimal effort.
Publisher
Cornell University Library, arXiv.org
This website uses cookies to ensure you get the best experience on our website.