Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Jiao, Kechen, Fang, Zhirui, Liu, Jiahao, Li, Bei, Wang, Qifan, Liu, Xinyu, Ruan, Junhao, Qiao, Zhongjian, Zhu, Yifan, Xu, Yaxin, Wang, Jingang, Li, Xiu
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2509.08500
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912580563369984
author	Jiao, Kechen Fang, Zhirui Liu, Jiahao Li, Bei Wang, Qifan Liu, Xinyu Ruan, Junhao Qiao, Zhongjian Zhu, Yifan Xu, Yaxin Wang, Jingang Li, Xiu
author_facet	Jiao, Kechen Fang, Zhirui Liu, Jiahao Li, Bei Wang, Qifan Liu, Xinyu Ruan, Junhao Qiao, Zhongjian Zhu, Yifan Xu, Yaxin Wang, Jingang Li, Xiu
contents	Using effective generalization capabilities of vision language models (VLMs) in context-specific dynamic tasks for embodied artificial intelligence remains a significant challenge. Although supervised fine-tuned models can better align with the real physical world, they still exhibit sluggish responses and hallucination issues in dynamically changing environments, necessitating further alignment. Existing post-SFT methods, reliant on reinforcement learning and chain-of-thought (CoT) approaches, are constrained by sparse rewards and action-only optimization, resulting in low sample efficiency, poor consistency, and model degradation. To address these issues, this paper proposes Thought-Centric Preference Optimization (TCPO) for effective embodied decision-making. Specifically, TCPO introduces a stepwise preference-based optimization approach, transforming sparse reward signals into richer step sample pairs. It emphasizes the alignment of the model's intermediate reasoning process, mitigating the problem of model degradation. Moreover, by incorporating Action Policy Consistency Constraint (APC), it further imposes consistency constraints on the model output. Experiments in the ALFWorld environment demonstrate an average success rate of 26.67%, achieving a 6% improvement over RL4VLM and validating the effectiveness of our approach in mitigating model degradation after fine-tuning. These results highlight the potential of integrating preference-based learning techniques with CoT processes to enhance the decision-making capabilities of vision-language models in embodied agents.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_08500
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	TCPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making Jiao, Kechen Fang, Zhirui Liu, Jiahao Li, Bei Wang, Qifan Liu, Xinyu Ruan, Junhao Qiao, Zhongjian Zhu, Yifan Xu, Yaxin Wang, Jingang Li, Xiu Artificial Intelligence Using effective generalization capabilities of vision language models (VLMs) in context-specific dynamic tasks for embodied artificial intelligence remains a significant challenge. Although supervised fine-tuned models can better align with the real physical world, they still exhibit sluggish responses and hallucination issues in dynamically changing environments, necessitating further alignment. Existing post-SFT methods, reliant on reinforcement learning and chain-of-thought (CoT) approaches, are constrained by sparse rewards and action-only optimization, resulting in low sample efficiency, poor consistency, and model degradation. To address these issues, this paper proposes Thought-Centric Preference Optimization (TCPO) for effective embodied decision-making. Specifically, TCPO introduces a stepwise preference-based optimization approach, transforming sparse reward signals into richer step sample pairs. It emphasizes the alignment of the model's intermediate reasoning process, mitigating the problem of model degradation. Moreover, by incorporating Action Policy Consistency Constraint (APC), it further imposes consistency constraints on the model output. Experiments in the ALFWorld environment demonstrate an average success rate of 26.67%, achieving a 6% improvement over RL4VLM and validating the effectiveness of our approach in mitigating model degradation after fine-tuning. These results highlight the potential of integrating preference-based learning techniques with CoT processes to enhance the decision-making capabilities of vision-language models in embodied agents.
title	TCPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making
topic	Artificial Intelligence
url	https://arxiv.org/abs/2509.08500

Similar Items