Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.04737 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866911691466342400 |
|---|---|
| author | Yue, Baoqing Zhu, Zihan Han, Yutong Fan, Brian Sun, Qian Feng, Jichen Yang, Hufei Zhang, Yifan Wang, Mengdi |
| author_facet | Yue, Baoqing Zhu, Zihan Han, Yutong Fan, Brian Sun, Qian Feng, Jichen Yang, Hufei Zhang, Yifan Wang, Mengdi |
| contents | Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses a model's reasoning ability through budgeted multi-turn interaction. We evaluate models under this framework in two settings: Interactive Proofs, where models interact with a judge to solve Logic, UI2Html, and Mathematics tasks under objective feedback; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a more robust assessment of this dimension of model intelligence, revealing substantial room for improvement in interactive scenarios. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2603_04737 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Interactive Benchmarks Yue, Baoqing Zhu, Zihan Han, Yutong Fan, Brian Sun, Qian Feng, Jichen Yang, Hufei Zhang, Yifan Wang, Mengdi Artificial Intelligence Computation and Language Machine Learning Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses a model's reasoning ability through budgeted multi-turn interaction. We evaluate models under this framework in two settings: Interactive Proofs, where models interact with a judge to solve Logic, UI2Html, and Mathematics tasks under objective feedback; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a more robust assessment of this dimension of model intelligence, revealing substantial room for improvement in interactive scenarios. |
| title | Interactive Benchmarks |
| topic | Artificial Intelligence Computation and Language Machine Learning |
| url | https://arxiv.org/abs/2603.04737 |