Asset Details
MbrlCatalogueTitleDetail
Do you wish to reserve the book?
DatBench: Discriminative, Faithful, and Efficient VLM Evaluations
by
Mentzer, Kaleigh
, Gaza, Bogdan
, Deng, Alvin
, DatologyAI
, Teh, Darren
, Burstein, Paul
, Wang, Zhengping
, Fang, Alex
, Blakeney, Cody
, Telanoff, Jason
, Joshi, Siddharth
, Monti, Ricardo
, Amro Abbas
, Loftin, Scott
, Pan, Fan
, Adiga, Rishabh
, Lee, Jason
, Yin, Haoli
, Leavitt, Matthew
, Wills, Josh
, Maini, Pratyush
, Das, Spandan
, Morcos, Ari
, Larsen, Brett
, Schwab, David
, Mongstad, Haakon
, Vineeth Dorna
, Merrick, Luke
, Jiang, Tony
, Urbanek, Jack
, Doshi, Parth
, Carranza, Aldo
in
Computing costs
/ Datasets
/ Failure modes
/ Filtration
/ Multiple choice
/ Questions
2026
Hey, we have placed the reservation for you!
By the way, why not check out events that you can attend while you pick your title.
You are currently in the queue to collect this book. You will be notified once it is your turn to collect the book.
Oops! Something went wrong.
Looks like we were not able to place the reservation. Kindly try again later.
Are you sure you want to remove the book from the shelf?
DatBench: Discriminative, Faithful, and Efficient VLM Evaluations
by
Mentzer, Kaleigh
, Gaza, Bogdan
, Deng, Alvin
, DatologyAI
, Teh, Darren
, Burstein, Paul
, Wang, Zhengping
, Fang, Alex
, Blakeney, Cody
, Telanoff, Jason
, Joshi, Siddharth
, Monti, Ricardo
, Amro Abbas
, Loftin, Scott
, Pan, Fan
, Adiga, Rishabh
, Lee, Jason
, Yin, Haoli
, Leavitt, Matthew
, Wills, Josh
, Maini, Pratyush
, Das, Spandan
, Morcos, Ari
, Larsen, Brett
, Schwab, David
, Mongstad, Haakon
, Vineeth Dorna
, Merrick, Luke
, Jiang, Tony
, Urbanek, Jack
, Doshi, Parth
, Carranza, Aldo
in
Computing costs
/ Datasets
/ Failure modes
/ Filtration
/ Multiple choice
/ Questions
2026
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
Do you wish to request the book?
DatBench: Discriminative, Faithful, and Efficient VLM Evaluations
by
Mentzer, Kaleigh
, Gaza, Bogdan
, Deng, Alvin
, DatologyAI
, Teh, Darren
, Burstein, Paul
, Wang, Zhengping
, Fang, Alex
, Blakeney, Cody
, Telanoff, Jason
, Joshi, Siddharth
, Monti, Ricardo
, Amro Abbas
, Loftin, Scott
, Pan, Fan
, Adiga, Rishabh
, Lee, Jason
, Yin, Haoli
, Leavitt, Matthew
, Wills, Josh
, Maini, Pratyush
, Das, Spandan
, Morcos, Ari
, Larsen, Brett
, Schwab, David
, Mongstad, Haakon
, Vineeth Dorna
, Merrick, Luke
, Jiang, Tony
, Urbanek, Jack
, Doshi, Parth
, Carranza, Aldo
in
Computing costs
/ Datasets
/ Failure modes
/ Filtration
/ Multiple choice
/ Questions
2026
Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy
We have requested the book for you!
Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.
Oops! Something went wrong.
Looks like we were not able to place your request. Kindly try again later.
DatBench: Discriminative, Faithful, and Efficient VLM Evaluations
Paper
DatBench: Discriminative, Faithful, and Efficient VLM Evaluations
2026
Request Book From Autostore
and Choose the Collection Method
Overview
Empirical evaluation serves as the primary compass guiding research progress in foundation models. Despite a large body of work focused on training frontier vision-language models (VLMs), approaches to their evaluation remain nascent. To guide their maturation, we propose three desiderata that evaluations should satisfy: (1) faithfulness to the modality and application, (2) discriminability between models of varying quality, and (3) efficiency in compute. Through this lens, we identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, poorly reflect downstream use cases, and saturate early as models improve; (ii) blindly solvable questions, which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of examples in certain datasets. Regarding efficiency, the computational burden of evaluating frontier models has become prohibitive: by some accounts, nearly 20% of development compute is devoted to evaluation alone. Rather than discarding existing benchmarks, we curate them via transformation and filtering to maximize fidelity and discriminability. We find that converting multiple-choice questions to generative tasks reveals sharp capability drops of up to 35%. In addition, filtering blindly solvable and mislabeled samples improves discriminative power while simultaneously reducing computational cost. We release DatBench-Full, a cleaned evaluation suite of 33 datasets spanning nine VLM capabilities, and DatBench, a discriminative subset that achieves 13x average speedup (up to 50x) while closely matching the discriminative power of the original datasets. Our work outlines a path toward evaluation practices that are both rigorous and sustainable as VLMs continue to scale.
Publisher
Cornell University Library, arXiv.org
Subject
This website uses cookies to ensure you get the best experience on our website.