Saved in:
Bibliographic Details
Main Authors: Yue, Baoqing, Zhu, Zihan, Han, Yutong, Fan, Brian, Sun, Qian, Feng, Jichen, Yang, Hufei, Zhang, Yifan, Wang, Mengdi
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.04737
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses a model's reasoning ability through budgeted multi-turn interaction. We evaluate models under this framework in two settings: Interactive Proofs, where models interact with a judge to solve Logic, UI2Html, and Mathematics tasks under objective feedback; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a more robust assessment of this dimension of model intelligence, revealing substantial room for improvement in interactive scenarios.