Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Jiao, Siwen, Lv, Tianxiong, Qian, Kangan, Zhao, Chenxu, Zhu, Xiuyuan, Li, Tianlun, Cheng, Xiaolong, Li, Jinyu, Liao, Zhihao, Cai, Yang
Format:	Preprint
Publié:	2026
Sujets:	Computer Vision and Pattern Recognition
Accès en ligne:	https://arxiv.org/abs/2601.07695
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866914255973908480
author	Jiao, Siwen Lv, Tianxiong Qian, Kangan Zhao, Chenxu Zhu, Xiuyuan Li, Tianlun Cheng, Xiaolong Li, Jinyu Liao, Zhihao Cai, Yang
author_facet	Jiao, Siwen Lv, Tianxiong Qian, Kangan Zhao, Chenxu Zhu, Xiuyuan Li, Tianlun Cheng, Xiaolong Li, Jinyu Liao, Zhihao Cai, Yang
contents	Vision-Language Models (VLMs) face a critical bottleneck in achieving precise numerical prediction for 3D scene understanding. Traditional reinforcement learning (RL) approaches, primarily based on relative ranking, often suffer from severe reward sparsity and gradient instability, failing to effectively exploit the verifiable signals provided by 3D physical constraints. Notably, in standard GRPO frameworks, relative normalization causes "near-miss" samples (characterized by small but non-zero errors) to suffer from advantage collapse. This leads to a severe data utilization bottleneck where valuable boundary samples are discarded during optimization. To address this, we introduce the Smooth Numerical Reward Activation (SNRA) operator and the Absolute-Preserving GRPO (AP-GRPO) framework. SNRA employs a dynamically parameterized Sigmoid function to transform raw feedback into a dense, continuous reward continuum. Concurrently, AP-GRPO integrates absolute scalar gradients to mitigate the numerical information loss inherent in conventional relative-ranking mechanisms. By leveraging this approach, we constructed Numerical3D-50k, a dataset comprising 50,000 verifiable 3D subtasks. Empirical results indicate that AP-GRPO achieves performance parity with large-scale supervised methods while maintaining higher data efficiency, effectively activating latent 3D reasoning in VLMs without requiring architectural modifications.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_07695
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Smooth Operator: Smooth Verifiable Reward Activates Spatial Reasoning Ability of Vision-Language Model Jiao, Siwen Lv, Tianxiong Qian, Kangan Zhao, Chenxu Zhu, Xiuyuan Li, Tianlun Cheng, Xiaolong Li, Jinyu Liao, Zhihao Cai, Yang Computer Vision and Pattern Recognition Vision-Language Models (VLMs) face a critical bottleneck in achieving precise numerical prediction for 3D scene understanding. Traditional reinforcement learning (RL) approaches, primarily based on relative ranking, often suffer from severe reward sparsity and gradient instability, failing to effectively exploit the verifiable signals provided by 3D physical constraints. Notably, in standard GRPO frameworks, relative normalization causes "near-miss" samples (characterized by small but non-zero errors) to suffer from advantage collapse. This leads to a severe data utilization bottleneck where valuable boundary samples are discarded during optimization. To address this, we introduce the Smooth Numerical Reward Activation (SNRA) operator and the Absolute-Preserving GRPO (AP-GRPO) framework. SNRA employs a dynamically parameterized Sigmoid function to transform raw feedback into a dense, continuous reward continuum. Concurrently, AP-GRPO integrates absolute scalar gradients to mitigate the numerical information loss inherent in conventional relative-ranking mechanisms. By leveraging this approach, we constructed Numerical3D-50k, a dataset comprising 50,000 verifiable 3D subtasks. Empirical results indicate that AP-GRPO achieves performance parity with large-scale supervised methods while maintaining higher data efficiency, effectively activating latent 3D reasoning in VLMs without requiring architectural modifications.
title	Smooth Operator: Smooth Verifiable Reward Activates Spatial Reasoning Ability of Vision-Language Model
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2601.07695

Documents similaires