Saved in:
Bibliographic Details
Main Authors: Tang, Xiangru, Xu, Wanghan, Wang, Yujie, Guo, Zijie, Shao, Daniel, Chen, Jiapeng, Zhang, Cixuan, Wang, Ziyi, Zhang, Lixin, Wan, Guancheng, Zhang, Wenlong, Bai, Lei, Yin, Zhenfei, Torr, Philip, Wang, Hanrui, Jin, Di
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2509.21193
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912604939616256
author Tang, Xiangru
Xu, Wanghan
Wang, Yujie
Guo, Zijie
Shao, Daniel
Chen, Jiapeng
Zhang, Cixuan
Wang, Ziyi
Zhang, Lixin
Wan, Guancheng
Zhang, Wenlong
Bai, Lei
Yin, Zhenfei
Torr, Philip
Wang, Hanrui
Jin, Di
author_facet Tang, Xiangru
Xu, Wanghan
Wang, Yujie
Guo, Zijie
Shao, Daniel
Chen, Jiapeng
Zhang, Cixuan
Wang, Ziyi
Zhang, Lixin
Wan, Guancheng
Zhang, Wenlong
Bai, Lei
Yin, Zhenfei
Torr, Philip
Wang, Hanrui
Jin, Di
contents Large language models (LLMs) have recently shown strong progress on scientific reasoning, yet two major bottlenecks remain. First, explicit retrieval fragments reasoning, imposing a hidden "tool tax" of extra tokens and steps. Second, multi-agent pipelines often dilute strong solutions by averaging across all candidates. We address these challenges with a unified framework that combines implicit retrieval and structured collaboration. At its foundation, a Monitor-based retrieval module operates at the token level, integrating external knowledge with minimal disruption to reasoning. On top of this substrate, Hierarchical Solution Refinement (HSR) iteratively designates each candidate as an anchor to be repaired by its peers, while Quality-Aware Iterative Reasoning (QAIR) adapts refinement to solution quality. On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3\% accuracy -- the highest reported to date, surpassing the strongest agent baseline by 13.4 points and leading frontier LLMs by up to 18.1 points, while simultaneously reducing token usage by 53.5\% and agent steps by 43.7\%. Results on SuperGPQA and TRQA confirm robustness across domains. Error analysis shows that reasoning failures and knowledge gaps co-occur in over 85\% of cases, while diversity analysis reveals a clear dichotomy: retrieval tasks benefit from solution variety, whereas reasoning tasks favor consensus. Together, these findings demonstrate how implicit augmentation and structured refinement overcome the inefficiencies of explicit tool use and uniform aggregation. Code is available at: https://github.com/tangxiangru/Eigen-1.
format Preprint
id arxiv_https___arxiv_org_abs_2509_21193
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning
Tang, Xiangru
Xu, Wanghan
Wang, Yujie
Guo, Zijie
Shao, Daniel
Chen, Jiapeng
Zhang, Cixuan
Wang, Ziyi
Zhang, Lixin
Wan, Guancheng
Zhang, Wenlong
Bai, Lei
Yin, Zhenfei
Torr, Philip
Wang, Hanrui
Jin, Di
Computation and Language
Artificial Intelligence
Large language models (LLMs) have recently shown strong progress on scientific reasoning, yet two major bottlenecks remain. First, explicit retrieval fragments reasoning, imposing a hidden "tool tax" of extra tokens and steps. Second, multi-agent pipelines often dilute strong solutions by averaging across all candidates. We address these challenges with a unified framework that combines implicit retrieval and structured collaboration. At its foundation, a Monitor-based retrieval module operates at the token level, integrating external knowledge with minimal disruption to reasoning. On top of this substrate, Hierarchical Solution Refinement (HSR) iteratively designates each candidate as an anchor to be repaired by its peers, while Quality-Aware Iterative Reasoning (QAIR) adapts refinement to solution quality. On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3\% accuracy -- the highest reported to date, surpassing the strongest agent baseline by 13.4 points and leading frontier LLMs by up to 18.1 points, while simultaneously reducing token usage by 53.5\% and agent steps by 43.7\%. Results on SuperGPQA and TRQA confirm robustness across domains. Error analysis shows that reasoning failures and knowledge gaps co-occur in over 85\% of cases, while diversity analysis reveals a clear dichotomy: retrieval tasks benefit from solution variety, whereas reasoning tasks favor consensus. Together, these findings demonstrate how implicit augmentation and structured refinement overcome the inefficiencies of explicit tool use and uniform aggregation. Code is available at: https://github.com/tangxiangru/Eigen-1.
title Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2509.21193