Saved in:
Bibliographic Details
Main Authors: Wan, Cong, Guo, Zeyu, Li, Jiangyang, Dong, SongLin, Bai, Yifan, Peng, Lin, Ma, Zhiheng, Gong, Yihong
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.00461
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917354544300032
author Wan, Cong
Guo, Zeyu
Li, Jiangyang
Dong, SongLin
Bai, Yifan
Peng, Lin
Ma, Zhiheng
Gong, Yihong
author_facet Wan, Cong
Guo, Zeyu
Li, Jiangyang
Dong, SongLin
Bai, Yifan
Peng, Lin
Ma, Zhiheng
Gong, Yihong
contents We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.
format Preprint
id arxiv_https___arxiv_org_abs_2603_00461
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle ReMoT: Reinforcement Learning with Motion Contrast Triplets
Wan, Cong
Guo, Zeyu
Li, Jiangyang
Dong, SongLin
Bai, Yifan
Peng, Lin
Ma, Zhiheng
Gong, Yihong
Computer Vision and Pattern Recognition
We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.
title ReMoT: Reinforcement Learning with Motion Contrast Triplets
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2603.00461