Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	An, Sohyun, Yuan, Shuibenyang, Lee, Hayeon, Hsieh, Cho-Jui, Min, Alexander
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.12967
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915937158955008
author	An, Sohyun Yuan, Shuibenyang Lee, Hayeon Hsieh, Cho-Jui Min, Alexander
author_facet	An, Sohyun Yuan, Shuibenyang Lee, Hayeon Hsieh, Cho-Jui Min, Alexander
contents	Reinforcement Learning (RL) has shown strong potential for optimizing search agents in complex information retrieval tasks. However, existing approaches predominantly rely on gold supervision, such as ground-truth answers, which is difficult to scale. To address this limitation, we propose Cycle-Consistent Search (CCS), a gold-supervision-free framework for training search agents, inspired by cycle-consistency techniques from unsupervised machine translation and image-to-image translation. Our key hypothesis is that an optimal search trajectory, unlike insufficient or irrelevant ones, serves as a lossless encoding of the question's intent. Consequently, a high-quality trajectory should preserve the information required to accurately reconstruct the original question, thereby inducing a reward signal for policy optimization. However, naive cycle-consistency objectives are vulnerable to information leakage, as reconstruction may rely on superficial lexical cues rather than the underlying search process. To reduce this effect, we apply information bottlenecks, including exclusion of the final response and named entity recognition (NER) masking of search queries. These constraints force reconstruction to rely on retrieved observations together with the structural scaffold, ensuring that the resulting reward signal reflects informational adequacy rather than linguistic redundancy. Experiments on question-answering benchmarks show that CCS achieves performance comparable to supervised baselines while outperforming prior methods that do not rely on gold supervision. These results suggest that CCS provides a scalable training paradigm for training search agents in settings where gold supervision is unavailable.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_12967
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training An, Sohyun Yuan, Shuibenyang Lee, Hayeon Hsieh, Cho-Jui Min, Alexander Artificial Intelligence Reinforcement Learning (RL) has shown strong potential for optimizing search agents in complex information retrieval tasks. However, existing approaches predominantly rely on gold supervision, such as ground-truth answers, which is difficult to scale. To address this limitation, we propose Cycle-Consistent Search (CCS), a gold-supervision-free framework for training search agents, inspired by cycle-consistency techniques from unsupervised machine translation and image-to-image translation. Our key hypothesis is that an optimal search trajectory, unlike insufficient or irrelevant ones, serves as a lossless encoding of the question's intent. Consequently, a high-quality trajectory should preserve the information required to accurately reconstruct the original question, thereby inducing a reward signal for policy optimization. However, naive cycle-consistency objectives are vulnerable to information leakage, as reconstruction may rely on superficial lexical cues rather than the underlying search process. To reduce this effect, we apply information bottlenecks, including exclusion of the final response and named entity recognition (NER) masking of search queries. These constraints force reconstruction to rely on retrieved observations together with the structural scaffold, ensuring that the resulting reward signal reflects informational adequacy rather than linguistic redundancy. Experiments on question-answering benchmarks show that CCS achieves performance comparable to supervised baselines while outperforming prior methods that do not rely on gold supervision. These results suggest that CCS provides a scalable training paradigm for training search agents in settings where gold supervision is unavailable.
title	Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training
topic	Artificial Intelligence
url	https://arxiv.org/abs/2604.12967

Similar Items