Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Mao, Yiming, Yu, Zixi, Mao, Weixin, Li, Yinhao, Hu, Qirui, Lan, Zihan, Zhu, Minzhao, Chen, Hua
Format:	Preprint
Published:	2026
Subjects:	Robotics Artificial Intelligence Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2604.03037
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913050305495040
author	Mao, Yiming Yu, Zixi Mao, Weixin Li, Yinhao Hu, Qirui Lan, Zihan Zhu, Minzhao Chen, Hua
author_facet	Mao, Yiming Yu, Zixi Mao, Weixin Li, Yinhao Hu, Qirui Lan, Zihan Zhu, Minzhao Chen, Hua
contents	Long-horizon robotic manipulation remains challenging for reinforcement learning (RL) because sparse rewards provide limited guidance for credit assignment. Practical policy improvement thus relies on richer intermediate supervision, such as dense progress rewards, which are costly to obtain and ill-suited to non-monotonic behaviors such as backtracking and recovery. To address this, we propose Advantage Reward Modeling (ARM), a framework that shifts from hard-to-quantify absolute progress to estimating relative advantage. We introduce a cost-effective tri-state labeling strategy -- Progressive, Regressive, and Stagnant -- that reduces human cognitive overhead while ensuring high cross-annotator consistency. By training on these intuitive signals, ARM enables automated progress annotation for both complete demonstrations and fragmented DAgger-style data. Integrating ARM into an offline RL pipeline allows for adaptive action-reward reweighting, effectively filtering suboptimal samples. Our approach achieves a 99.4% success rate on a challenging long-horizon towel-folding task, demonstrating improved stability and data efficiency over current VLA baselines with near-zero human intervention during policy training.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_03037
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	ARM: Advantage Reward Modeling for Long-Horizon Manipulation Mao, Yiming Yu, Zixi Mao, Weixin Li, Yinhao Hu, Qirui Lan, Zihan Zhu, Minzhao Chen, Hua Robotics Artificial Intelligence Computer Vision and Pattern Recognition Long-horizon robotic manipulation remains challenging for reinforcement learning (RL) because sparse rewards provide limited guidance for credit assignment. Practical policy improvement thus relies on richer intermediate supervision, such as dense progress rewards, which are costly to obtain and ill-suited to non-monotonic behaviors such as backtracking and recovery. To address this, we propose Advantage Reward Modeling (ARM), a framework that shifts from hard-to-quantify absolute progress to estimating relative advantage. We introduce a cost-effective tri-state labeling strategy -- Progressive, Regressive, and Stagnant -- that reduces human cognitive overhead while ensuring high cross-annotator consistency. By training on these intuitive signals, ARM enables automated progress annotation for both complete demonstrations and fragmented DAgger-style data. Integrating ARM into an offline RL pipeline allows for adaptive action-reward reweighting, effectively filtering suboptimal samples. Our approach achieves a 99.4% success rate on a challenging long-horizon towel-folding task, demonstrating improved stability and data efficiency over current VLA baselines with near-zero human intervention during policy training.
title	ARM: Advantage Reward Modeling for Long-Horizon Manipulation
topic	Robotics Artificial Intelligence Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2604.03037

Similar Items