Saved in:
Bibliographic Details
Main Authors: Zhang, Weizhi, Wei, Xiaokai, Huang, Wei-Chieh, Hui, Zheng, Wang, Chen, Gong, Michelle, Yu, Philip S.
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.25973
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908916331315200
author Zhang, Weizhi
Wei, Xiaokai
Huang, Wei-Chieh
Hui, Zheng
Wang, Chen
Gong, Michelle
Yu, Philip S.
author_facet Zhang, Weizhi
Wei, Xiaokai
Huang, Wei-Chieh
Hui, Zheng
Wang, Chen
Gong, Michelle
Yu, Philip S.
contents Recent advancements in Large Language Models (LLMs) have expanded context windows to million-token scales, yet benchmarks for evaluating memory remain limited to short-session synthetic dialogues. We introduce \textsc{MemoryCD}, the first large-scale, user-centric, cross-domain memory benchmark derived from lifelong real-world behaviors in the Amazon Review dataset. Unlike existing memory datasets that rely on scripted personas to generate synthetic user data, \textsc{MemoryCD} tracks authentic user interactions across years and multiple domains. We construct a multi-faceted long-context memory evaluation pipeline of 14 state-of-the-art LLM base models with 6 memory method baselines on 4 distinct personalization tasks over 12 diverse domains to evaluate an agent's ability to simulate real user behaviors in both single and cross-domain settings. Our analysis reveals that existing memory methods are far from user satisfaction in various domains, offering the first testbed for cross-domain life-long personalization evaluation.
format Preprint
id arxiv_https___arxiv_org_abs_2603_25973
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization
Zhang, Weizhi
Wei, Xiaokai
Huang, Wei-Chieh
Hui, Zheng
Wang, Chen
Gong, Michelle
Yu, Philip S.
Computation and Language
Recent advancements in Large Language Models (LLMs) have expanded context windows to million-token scales, yet benchmarks for evaluating memory remain limited to short-session synthetic dialogues. We introduce \textsc{MemoryCD}, the first large-scale, user-centric, cross-domain memory benchmark derived from lifelong real-world behaviors in the Amazon Review dataset. Unlike existing memory datasets that rely on scripted personas to generate synthetic user data, \textsc{MemoryCD} tracks authentic user interactions across years and multiple domains. We construct a multi-faceted long-context memory evaluation pipeline of 14 state-of-the-art LLM base models with 6 memory method baselines on 4 distinct personalization tasks over 12 diverse domains to evaluate an agent's ability to simulate real user behaviors in both single and cross-domain settings. Our analysis reveals that existing memory methods are far from user satisfaction in various domains, offering the first testbed for cross-domain life-long personalization evaluation.
title MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization
topic Computation and Language
url https://arxiv.org/abs/2603.25973