Asset Details

MbrlCatalogueTitleDetail

Do you wish to reserve the book?

MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures

by Yue, Xiang , Xue, Fuzhao , Shah, Mahir , Deng, Yuntian , Yang, You , Neubig, Graham , Ni, Jinjie , Jain, Kabir

in Benchmarks / Chatbots / Large language models / Queries

2024

Yes Please

Hey, we have placed the reservation for you!

By the way, why not check out events that you can attend while you pick your title.

Oops! Something went wrong.

Looks like we were not able to place the reservation. Kindly try again later.

Paper

MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures

Yue, Xiang,

Xue, Fuzhao,

Shah, Mahir,

Deng, Yuntian,

Yang, You,

Neubig, Graham,

Ni, Jinjie,

Jain, Kabir

2024

Overview

Evaluating large language models (LLMs) is challenging. Traditional ground-truth-based benchmarks fail to capture the comprehensiveness and nuance of real-world queries, while LLM-as-judge benchmarks suffer from grading biases and limited query quantity. Both of them may also become contaminated over time. User-facing evaluation, such as Chatbot Arena, provides reliable signals but is costly and slow. In this work, we propose MixEval, a new paradigm for establishing efficient, gold-standard LLM evaluation by strategically mixing off-the-shelf benchmarks. It bridges (1) comprehensive and well-distributed real-world user queries and (2) efficient and fairly-graded ground-truth-based benchmarks, by matching queries mined from the web with similar queries from existing benchmarks. Based on MixEval, we further build MixEval-Hard, which offers more room for model improvement. Our benchmarks' advantages lie in (1) a 0.96 model ranking correlation with Chatbot Arena arising from the highly impartial query distribution and grading mechanism, (2) fast, cheap, and reproducible execution (6% of the time and cost of MMLU), and (3) dynamic evaluation enabled by the rapid and stable data update pipeline. We provide extensive meta-evaluation and analysis for our and existing LLM benchmarks to deepen the community's understanding of LLM evaluation and guide future research directions.

Share this book

Add to My Shelf

Publisher

Cornell University Library, arXiv.org

Subject

Benchmarks

/ Chatbots

/ Large language models

/ Queries