Asset Details
MbrlCatalogueTitleDetail
Do you wish to reserve the book?
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
by
Mihaylov, Todor
, Singh, Aaditya K
, Goswami, Vedanuj
, Chatterji, Niladri S
, Bhargava, Prajjwal
, Hupkes, Dieuwke
, Koyejo, Sanmi
, Schaeffer, Rylan
, Tang, Binh
, Subramanian, Ranjan
, Narang, Sharan
, Punit Singh Koura
, Madaan, Lovish
, Edunov, Sergey
in
Annotations
/ Benchmarks
/ Conversational artificial intelligence
/ Natural language processing
/ User satisfaction
2025
Hey, we have placed the reservation for you!
By the way, why not check out events that you can attend while you pick your title.
You are currently in the queue to collect this book. You will be notified once it is your turn to collect the book.
Oops! Something went wrong.
Looks like we were not able to place the reservation. Kindly try again later.
Are you sure you want to remove the book from the shelf?
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
by
Mihaylov, Todor
, Singh, Aaditya K
, Goswami, Vedanuj
, Chatterji, Niladri S
, Bhargava, Prajjwal
, Hupkes, Dieuwke
, Koyejo, Sanmi
, Schaeffer, Rylan
, Tang, Binh
, Subramanian, Ranjan
, Narang, Sharan
, Punit Singh Koura
, Madaan, Lovish
, Edunov, Sergey
in
Annotations
/ Benchmarks
/ Conversational artificial intelligence
/ Natural language processing
/ User satisfaction
2025
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
Do you wish to request the book?
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
by
Mihaylov, Todor
, Singh, Aaditya K
, Goswami, Vedanuj
, Chatterji, Niladri S
, Bhargava, Prajjwal
, Hupkes, Dieuwke
, Koyejo, Sanmi
, Schaeffer, Rylan
, Tang, Binh
, Subramanian, Ranjan
, Narang, Sharan
, Punit Singh Koura
, Madaan, Lovish
, Edunov, Sergey
in
Annotations
/ Benchmarks
/ Conversational artificial intelligence
/ Natural language processing
/ User satisfaction
2025
Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy
We have requested the book for you!
Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.
Oops! Something went wrong.
Looks like we were not able to place your request. Kindly try again later.
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
Paper
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
2025
Request Book From Autostore
and Choose the Collection Method
Overview
The explosion of high-performing conversational language models (LMs) has spurred a shift from classic natural language processing (NLP) benchmarks to expensive, time-consuming and noisy human evaluations - yet the relationship between these two evaluation strategies remains hazy. In this paper, we conduct a large-scale study of four Chat Llama 2 models, comparing their performance on 160 standard NLP benchmarks (e.g., MMLU, ARC, BIG-Bench Hard) against extensive human preferences on more than 11k single-turn and 2k multi-turn dialogues from over 2k human annotators. Our findings are striking: most NLP benchmarks strongly correlate with human evaluations, suggesting that cheaper, automated metrics can serve as surprisingly reliable predictors of human preferences. Three human evaluations, such as adversarial dishonesty and safety, are anticorrelated with NLP benchmarks, while two are uncorrelated. Moreover, through overparameterized linear regressions, we show that NLP scores can accurately predict human evaluations across different model scales, offering a path to reduce costly human annotation without sacrificing rigor. Overall, our results affirm the continued value of classic benchmarks and illuminate how to harness them to anticipate real-world user satisfaction - pointing to how NLP benchmarks can be leveraged to meet evaluation needs of our new era of conversational AI.
Publisher
Cornell University Library, arXiv.org
MBRLCatalogueRelatedBooks
Related Items
Related Items
We currently cannot retrieve any items related to this title. Kindly check back at a later time.
This website uses cookies to ensure you get the best experience on our website.