Saved in:
Bibliographic Details
Main Authors: Yan, Dawei, Zhang, Haokui, Huzhang, Guangda, Li, Yang, Wang, Yibo, Chen, Qing-Guo, Xu, Zhao, Luo, Weihua, Li, Ying, Dong, Wei, Shen, Chunhua
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.00503
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908857137102848
author Yan, Dawei
Zhang, Haokui
Huzhang, Guangda
Li, Yang
Wang, Yibo
Chen, Qing-Guo
Xu, Zhao
Luo, Weihua
Li, Ying
Dong, Wei
Shen, Chunhua
author_facet Yan, Dawei
Zhang, Haokui
Huzhang, Guangda
Li, Yang
Wang, Yibo
Chen, Qing-Guo
Xu, Zhao
Luo, Weihua
Li, Ying
Dong, Wei
Shen, Chunhua
contents Multimodal Large Language Models (MLLMs) based agents have demonstrated remarkable potential in autonomous web navigation. However, handling long-horizon tasks remains a critical bottleneck. Prevailing strategies often rely heavily on extensive data collection and model training, yet still struggle with high computational costs and insufficient reasoning capabilities when facing complex, long-horizon scenarios. To address this, we propose M$^2$, a training-free, memory-augmented framework designed to optimize context efficiency and decision-making robustness. Our approach incorporates a dual-tier memory mechanism that synergizes Dynamic Trajectory Summarization (Internal Memory) to compress verbose interaction history into concise state updates, and Insight Retrieval Augmentation (External Memory) to guide the agent with actionable guidelines retrieved from an offline insight bank. Extensive evaluations across WebVoyager and OnlineMind2Web demonstrate that M$^2$ consistently surpasses baselines, yielding up to a 19.6% success rate increase and 58.7% token reduction for Qwen3-VL-32B, while proprietary models like Claude achieve accuracy gains up to 12.5% alongside significantly lower computational overhead.
format Preprint
id arxiv_https___arxiv_org_abs_2603_00503
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle M$^2$: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval
Yan, Dawei
Zhang, Haokui
Huzhang, Guangda
Li, Yang
Wang, Yibo
Chen, Qing-Guo
Xu, Zhao
Luo, Weihua
Li, Ying
Dong, Wei
Shen, Chunhua
Computer Vision and Pattern Recognition
Multimodal Large Language Models (MLLMs) based agents have demonstrated remarkable potential in autonomous web navigation. However, handling long-horizon tasks remains a critical bottleneck. Prevailing strategies often rely heavily on extensive data collection and model training, yet still struggle with high computational costs and insufficient reasoning capabilities when facing complex, long-horizon scenarios. To address this, we propose M$^2$, a training-free, memory-augmented framework designed to optimize context efficiency and decision-making robustness. Our approach incorporates a dual-tier memory mechanism that synergizes Dynamic Trajectory Summarization (Internal Memory) to compress verbose interaction history into concise state updates, and Insight Retrieval Augmentation (External Memory) to guide the agent with actionable guidelines retrieved from an offline insight bank. Extensive evaluations across WebVoyager and OnlineMind2Web demonstrate that M$^2$ consistently surpasses baselines, yielding up to a 19.6% success rate increase and 58.7% token reduction for Qwen3-VL-32B, while proprietary models like Claude achieve accuracy gains up to 12.5% alongside significantly lower computational overhead.
title M$^2$: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2603.00503