Asset Details

MbrlCatalogueTitleDetail

Do you wish to reserve the book?

AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models

by Lu, Keming , Cheng, Qinyuan , Lin, Junyang , Qiu, Xipeng , Huang, Fei , Huang, Xuanjing , Bowen, Yu , Peng, Runyu , Zhu, Qin

in Benchmarks / Controllability / Datasets / Large language models / Reasoning

2025

Yes Please

Hey, we have placed the reservation for you!

By the way, why not check out events that you can attend while you pick your title.

Oops! Something went wrong.

Looks like we were not able to place the reservation. Kindly try again later.

Do you wish to request the book?

AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models

by Lu, Keming , Cheng, Qinyuan , Lin, Junyang , Qiu, Xipeng , Huang, Fei , Huang, Xuanjing , Bowen, Yu , Peng, Runyu , Zhu, Qin

in Benchmarks / Controllability / Datasets / Large language models / Reasoning

2025

Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy

How would you like to get it?

Submit

We have requested the book for you!

Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.

Oops! Something went wrong.

Looks like we were not able to place your request. Kindly try again later.

Paper

AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models

Lu, Keming,

Cheng, Qinyuan,

Lin, Junyang,

Qiu, Xipeng,

Huang, Fei,

Huang, Xuanjing,

Bowen, Yu,

Peng, Runyu,

Zhu, Qin

2025

Overview

While logical reasoning evaluation of Large Language Models (LLMs) has attracted significant attention, existing benchmarks predominantly rely on multiple-choice formats that are vulnerable to random guessing, leading to overestimated performance and substantial performance fluctuations. To obtain more accurate assessments of models' reasoning capabilities, we propose an automated method for synthesizing open-ended logic puzzles, and use it to develop a bilingual benchmark, AutoLogi. Our approach features program-based verification and controllable difficulty levels, enabling more reliable evaluation that better distinguishes models' reasoning abilities. Extensive evaluation of eight modern LLMs shows that AutoLogi can better reflect true model capabilities, with performance scores spanning from 35% to 73% compared to the narrower range of 21% to 37% on the source multiple-choice dataset. Beyond benchmark creation, this synthesis method can generate high-quality training data by incorporating program verifiers into the rejection sampling process, enabling systematic enhancement of LLMs' reasoning capabilities across diverse datasets.

Share this book

Add to My Shelf

Publisher

Cornell University Library, arXiv.org

Subject

Benchmarks

/ Controllability

/ Datasets

/ Large language models

/ Reasoning