Saved in:
| Main Authors: | Zheng, Zihan, Cui, Tianle, Wang, Taoran, Wang, Fengtao, Pan, Jiahui, He, Lewei, Chen, Qianglong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.01330 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Role-RL: Online Long-Context Processing with Role Reinforcement Learning for Distinct LLMs in Their Optimal Roles
by: He, Lewei, et al.
Published: (2024)
by: He, Lewei, et al.
Published: (2024)
GAIA: A Data Flywheel System for Training GUI Test-Time Scaling Critic Models
by: Wang, Shaokang, et al.
Published: (2026)
by: Wang, Shaokang, et al.
Published: (2026)
Verifiable Benchmarking of Long-Horizon Spatial Biology
by: Diks, Ian, et al.
Published: (2026)
by: Diks, Ian, et al.
Published: (2026)
GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior
by: Wu, Penghao, et al.
Published: (2025)
by: Wu, Penghao, et al.
Published: (2025)
Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue
by: Lin, Jingjie, et al.
Published: (2026)
by: Lin, Jingjie, et al.
Published: (2026)
ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks
by: Song, Yuanyi, et al.
Published: (2025)
by: Song, Yuanyi, et al.
Published: (2025)
RoboClaw: An Agentic Framework for Scalable Long-Horizon Robotic Tasks
by: Li, Ruiying, et al.
Published: (2026)
by: Li, Ruiying, et al.
Published: (2026)
CHD: Coupled Hierarchical Diffusion for Long-Horizon Tasks
by: Hao, Ce, et al.
Published: (2025)
by: Hao, Ce, et al.
Published: (2025)
GUI-Shepherd: Reliable Process Reward and Verification for Long-Sequence GUI Tasks
by: Chen, Cong, et al.
Published: (2025)
by: Chen, Cong, et al.
Published: (2025)
DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints
by: Zhang, Yinger, et al.
Published: (2026)
by: Zhang, Yinger, et al.
Published: (2026)
GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent
by: Zhao, Kangjia, et al.
Published: (2024)
by: Zhao, Kangjia, et al.
Published: (2024)
ContextFlow: Hierarchical Task-State Alignment for Long-Horizon Embodied Agents
by: Guo, Shuhan, et al.
Published: (2026)
by: Guo, Shuhan, et al.
Published: (2026)
META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI
by: Sun, Liangtai, et al.
Published: (2022)
by: Sun, Liangtai, et al.
Published: (2022)
EntWorld: A Holistic Environment and Benchmark for Verifiable Enterprise GUI Agents
by: Mo, Ying, et al.
Published: (2026)
by: Mo, Ying, et al.
Published: (2026)
Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation
by: Deng, Zehao, et al.
Published: (2025)
by: Deng, Zehao, et al.
Published: (2025)
UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios
by: Luo, Haotian, et al.
Published: (2025)
by: Luo, Haotian, et al.
Published: (2025)
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
by: Wang, Tianle, et al.
Published: (2026)
by: Wang, Tianle, et al.
Published: (2026)
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
by: Li, Junlong, et al.
Published: (2025)
by: Li, Junlong, et al.
Published: (2025)
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
by: Lu, Hao, et al.
Published: (2025)
by: Lu, Hao, et al.
Published: (2025)
D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies
by: Chen, Sen, et al.
Published: (2025)
by: Chen, Sen, et al.
Published: (2025)
AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
by: Tian, Shizuo, et al.
Published: (2025)
by: Tian, Shizuo, et al.
Published: (2025)
SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks
by: Wang, Tianyi, et al.
Published: (2026)
by: Wang, Tianyi, et al.
Published: (2026)
Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression
by: Zhou, Bowen, et al.
Published: (2026)
by: Zhou, Bowen, et al.
Published: (2026)
When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution
by: Zhu, Zilin, et al.
Published: (2026)
by: Zhu, Zilin, et al.
Published: (2026)
Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding
by: Li, Chenglin, et al.
Published: (2025)
by: Li, Chenglin, et al.
Published: (2025)
BOSS: Benchmark for Observation Space Shift in Long-Horizon Task
by: Yang, Yue, et al.
Published: (2025)
by: Yang, Yue, et al.
Published: (2025)
STMA: A Spatio-Temporal Memory Agent for Long-Horizon Embodied Task Planning
by: Lei, Mingcong, et al.
Published: (2025)
by: Lei, Mingcong, et al.
Published: (2025)
GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis
by: Zhai, Yuwen, et al.
Published: (2026)
by: Zhai, Yuwen, et al.
Published: (2026)
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
by: Lin, Kevin Qinghong, et al.
Published: (2024)
by: Lin, Kevin Qinghong, et al.
Published: (2024)
HiMem: Hierarchical Long-Term Memory for LLM Long-Horizon Agents
by: Zhang, Ningning, et al.
Published: (2026)
by: Zhang, Ningning, et al.
Published: (2026)
GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding
by: Chen, Dongping, et al.
Published: (2024)
by: Chen, Dongping, et al.
Published: (2024)
GUI-PRA: Process Reward Agent for GUI Tasks
by: Xiong, Tao, et al.
Published: (2025)
by: Xiong, Tao, et al.
Published: (2025)
GUI Knowledge Bench: Revealing the Knowledge Gap of VLMs in GUI Tasks
by: Shi, Chenrui, et al.
Published: (2025)
by: Shi, Chenrui, et al.
Published: (2025)
Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction
by: Zhang, Danyang, et al.
Published: (2023)
by: Zhang, Danyang, et al.
Published: (2023)
AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks
by: Jiang, Tanqiu, et al.
Published: (2026)
by: Jiang, Tanqiu, et al.
Published: (2026)
CRAFT-GUI: Curriculum-Reinforced Agent For GUI Tasks
by: Nong, Songqin, et al.
Published: (2025)
by: Nong, Songqin, et al.
Published: (2025)
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
by: Yang, Rui, et al.
Published: (2026)
by: Yang, Rui, et al.
Published: (2026)
LifeBench: A Benchmark for Long-Horizon Multi-Source Memory
by: Cheng, Zihao, et al.
Published: (2026)
by: Cheng, Zihao, et al.
Published: (2026)
GUI-GENESIS: Automated Synthesis of Efficient Environments with Verifiable Rewards for GUI Agent Post-Training
by: Cao, Yuan, et al.
Published: (2026)
by: Cao, Yuan, et al.
Published: (2026)
HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model
by: Hu, Mengkang, et al.
Published: (2024)
by: Hu, Mengkang, et al.
Published: (2024)
Similar Items
-
Role-RL: Online Long-Context Processing with Role Reinforcement Learning for Distinct LLMs in Their Optimal Roles
by: He, Lewei, et al.
Published: (2024) -
GAIA: A Data Flywheel System for Training GUI Test-Time Scaling Critic Models
by: Wang, Shaokang, et al.
Published: (2026) -
Verifiable Benchmarking of Long-Horizon Spatial Biology
by: Diks, Ian, et al.
Published: (2026) -
GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior
by: Wu, Penghao, et al.
Published: (2025) -
Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue
by: Lin, Jingjie, et al.
Published: (2026)