Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yan, Dawei, Zhang, Haokui, Huzhang, Guangda, Li, Yang, Wang, Yibo, Chen, Qing-Guo, Xu, Zhao, Luo, Weihua, Li, Ying, Dong, Wei, Shen, Chunhua
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2603.00503
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908857137102848
author	Yan, Dawei Zhang, Haokui Huzhang, Guangda Li, Yang Wang, Yibo Chen, Qing-Guo Xu, Zhao Luo, Weihua Li, Ying Dong, Wei Shen, Chunhua
author_facet	Yan, Dawei Zhang, Haokui Huzhang, Guangda Li, Yang Wang, Yibo Chen, Qing-Guo Xu, Zhao Luo, Weihua Li, Ying Dong, Wei Shen, Chunhua
contents	Multimodal Large Language Models (MLLMs) based agents have demonstrated remarkable potential in autonomous web navigation. However, handling long-horizon tasks remains a critical bottleneck. Prevailing strategies often rely heavily on extensive data collection and model training, yet still struggle with high computational costs and insufficient reasoning capabilities when facing complex, long-horizon scenarios. To address this, we propose M$^2$, a training-free, memory-augmented framework designed to optimize context efficiency and decision-making robustness. Our approach incorporates a dual-tier memory mechanism that synergizes Dynamic Trajectory Summarization (Internal Memory) to compress verbose interaction history into concise state updates, and Insight Retrieval Augmentation (External Memory) to guide the agent with actionable guidelines retrieved from an offline insight bank. Extensive evaluations across WebVoyager and OnlineMind2Web demonstrate that M$^2$ consistently surpasses baselines, yielding up to a 19.6% success rate increase and 58.7% token reduction for Qwen3-VL-32B, while proprietary models like Claude achieve accuracy gains up to 12.5% alongside significantly lower computational overhead.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_00503
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	M$^2$: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval Yan, Dawei Zhang, Haokui Huzhang, Guangda Li, Yang Wang, Yibo Chen, Qing-Guo Xu, Zhao Luo, Weihua Li, Ying Dong, Wei Shen, Chunhua Computer Vision and Pattern Recognition Multimodal Large Language Models (MLLMs) based agents have demonstrated remarkable potential in autonomous web navigation. However, handling long-horizon tasks remains a critical bottleneck. Prevailing strategies often rely heavily on extensive data collection and model training, yet still struggle with high computational costs and insufficient reasoning capabilities when facing complex, long-horizon scenarios. To address this, we propose M$^2$, a training-free, memory-augmented framework designed to optimize context efficiency and decision-making robustness. Our approach incorporates a dual-tier memory mechanism that synergizes Dynamic Trajectory Summarization (Internal Memory) to compress verbose interaction history into concise state updates, and Insight Retrieval Augmentation (External Memory) to guide the agent with actionable guidelines retrieved from an offline insight bank. Extensive evaluations across WebVoyager and OnlineMind2Web demonstrate that M$^2$ consistently surpasses baselines, yielding up to a 19.6% success rate increase and 58.7% token reduction for Qwen3-VL-32B, while proprietary models like Claude achieve accuracy gains up to 12.5% alongside significantly lower computational overhead.
title	M$^2$: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2603.00503

Similar Items