Saved in:
Bibliographic Details
Main Authors: Yue, Baoqing, Zhu, Zihan, Han, Yutong, Fan, Brian, Sun, Qian, Feng, Jichen, Yang, Hufei, Zhang, Yifan, Wang, Mengdi
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.04737
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911691466342400
author Yue, Baoqing
Zhu, Zihan
Han, Yutong
Fan, Brian
Sun, Qian
Feng, Jichen
Yang, Hufei
Zhang, Yifan
Wang, Mengdi
author_facet Yue, Baoqing
Zhu, Zihan
Han, Yutong
Fan, Brian
Sun, Qian
Feng, Jichen
Yang, Hufei
Zhang, Yifan
Wang, Mengdi
contents Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses a model's reasoning ability through budgeted multi-turn interaction. We evaluate models under this framework in two settings: Interactive Proofs, where models interact with a judge to solve Logic, UI2Html, and Mathematics tasks under objective feedback; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a more robust assessment of this dimension of model intelligence, revealing substantial room for improvement in interactive scenarios.
format Preprint
id arxiv_https___arxiv_org_abs_2603_04737
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Interactive Benchmarks
Yue, Baoqing
Zhu, Zihan
Han, Yutong
Fan, Brian
Sun, Qian
Feng, Jichen
Yang, Hufei
Zhang, Yifan
Wang, Mengdi
Artificial Intelligence
Computation and Language
Machine Learning
Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses a model's reasoning ability through budgeted multi-turn interaction. We evaluate models under this framework in two settings: Interactive Proofs, where models interact with a judge to solve Logic, UI2Html, and Mathematics tasks under objective feedback; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a more robust assessment of this dimension of model intelligence, revealing substantial room for improvement in interactive scenarios.
title Interactive Benchmarks
topic Artificial Intelligence
Computation and Language
Machine Learning
url https://arxiv.org/abs/2603.04737