Saved in:
Bibliographic Details
Main Authors: Liu, Xiaoyu, Zhang, Fuwei, Wu, Yiqing, Jia, Xinyu, Xia, Zenghua, Zhuang, Fuzhen, Zhang, Zhao, Jiang, Fei, Lin, Wei
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.01461
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918184462843904
author Liu, Xiaoyu
Zhang, Fuwei
Wu, Yiqing
Jia, Xinyu
Xia, Zenghua
Zhuang, Fuzhen
Zhang, Zhao
Jiang, Fei
Lin, Wei
author_facet Liu, Xiaoyu
Zhang, Fuwei
Wu, Yiqing
Jia, Xinyu
Xia, Zenghua
Zhuang, Fuzhen
Zhang, Zhao
Jiang, Fei
Lin, Wei
contents Generative retrieval (GR) has gained significant attention as an effective paradigm that integrates the capabilities of large language models (LLMs). It generally consists of two stages: constructing discrete semantic identifiers (IDs) for documents and retrieving documents by autoregressively generating ID tokens. The core challenge in GR is how to construct document IDs (DocIDS) with strong representational power. Good IDs should exhibit two key properties: similar documents should have more similar IDs, and each document should maintain a distinct and unique ID. However, most existing methods ignore native category information, which is common and critical in E-commerce. Therefore, we propose a novel ID learning method, CAtegory-Tree Integrated Document IDentifier (CAT-ID$^2$), incorporating prior category information into the semantic IDs. CAT-ID$^2$ includes three key modules: a Hierarchical Class Constraint Loss to integrate category information layer by layer during quantization, a Cluster Scale Constraint Loss for uniform ID token distribution, and a Dispersion Loss to improve the distinction of reconstructed documents. These components enable CAT-ID$^2$ to generate IDs that make similar documents more alike while preserving the uniqueness of different documents' representations. Extensive offline and online experiments confirm the effectiveness of our method, with online A/B tests showing a 0.33% increase in average orders per thousand users for ambiguous intent queries and 0.24% for long-tail queries.
format Preprint
id arxiv_https___arxiv_org_abs_2511_01461
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle CAT-ID$^2$: Category-Tree Integrated Document Identifier Learning for Generative Retrieval In E-commerce
Liu, Xiaoyu
Zhang, Fuwei
Wu, Yiqing
Jia, Xinyu
Xia, Zenghua
Zhuang, Fuzhen
Zhang, Zhao
Jiang, Fei
Lin, Wei
Information Retrieval
Generative retrieval (GR) has gained significant attention as an effective paradigm that integrates the capabilities of large language models (LLMs). It generally consists of two stages: constructing discrete semantic identifiers (IDs) for documents and retrieving documents by autoregressively generating ID tokens. The core challenge in GR is how to construct document IDs (DocIDS) with strong representational power. Good IDs should exhibit two key properties: similar documents should have more similar IDs, and each document should maintain a distinct and unique ID. However, most existing methods ignore native category information, which is common and critical in E-commerce. Therefore, we propose a novel ID learning method, CAtegory-Tree Integrated Document IDentifier (CAT-ID$^2$), incorporating prior category information into the semantic IDs. CAT-ID$^2$ includes three key modules: a Hierarchical Class Constraint Loss to integrate category information layer by layer during quantization, a Cluster Scale Constraint Loss for uniform ID token distribution, and a Dispersion Loss to improve the distinction of reconstructed documents. These components enable CAT-ID$^2$ to generate IDs that make similar documents more alike while preserving the uniqueness of different documents' representations. Extensive offline and online experiments confirm the effectiveness of our method, with online A/B tests showing a 0.33% increase in average orders per thousand users for ambiguous intent queries and 0.24% for long-tail queries.
title CAT-ID$^2$: Category-Tree Integrated Document Identifier Learning for Generative Retrieval In E-commerce
topic Information Retrieval
url https://arxiv.org/abs/2511.01461