Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yue, Baoqing, Zhu, Zihan, Han, Yutong, Fan, Brian, Sun, Qian, Feng, Jichen, Yang, Hufei, Zhang, Yifan, Wang, Mengdi
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2603.04737
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911691466342400
author	Yue, Baoqing Zhu, Zihan Han, Yutong Fan, Brian Sun, Qian Feng, Jichen Yang, Hufei Zhang, Yifan Wang, Mengdi
author_facet	Yue, Baoqing Zhu, Zihan Han, Yutong Fan, Brian Sun, Qian Feng, Jichen Yang, Hufei Zhang, Yifan Wang, Mengdi
contents	Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses a model's reasoning ability through budgeted multi-turn interaction. We evaluate models under this framework in two settings: Interactive Proofs, where models interact with a judge to solve Logic, UI2Html, and Mathematics tasks under objective feedback; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a more robust assessment of this dimension of model intelligence, revealing substantial room for improvement in interactive scenarios.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_04737
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Interactive Benchmarks Yue, Baoqing Zhu, Zihan Han, Yutong Fan, Brian Sun, Qian Feng, Jichen Yang, Hufei Zhang, Yifan Wang, Mengdi Artificial Intelligence Computation and Language Machine Learning Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses a model's reasoning ability through budgeted multi-turn interaction. We evaluate models under this framework in two settings: Interactive Proofs, where models interact with a judge to solve Logic, UI2Html, and Mathematics tasks under objective feedback; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a more robust assessment of this dimension of model intelligence, revealing substantial room for improvement in interactive scenarios.
title	Interactive Benchmarks
topic	Artificial Intelligence Computation and Language Machine Learning
url	https://arxiv.org/abs/2603.04737

Similar Items