Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yue, Baoqing, Zhu, Zihan, Han, Yutong, Fan, Brian, Sun, Qian, Feng, Jichen, Yang, Hufei, Zhang, Yifan, Wang, Mengdi
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2603.04737
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses a model's reasoning ability through budgeted multi-turn interaction. We evaluate models under this framework in two settings: Interactive Proofs, where models interact with a judge to solve Logic, UI2Html, and Mathematics tasks under objective feedback; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a more robust assessment of this dimension of model intelligence, revealing substantial room for improvement in interactive scenarios.

Similar Items