Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.00461 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866917354544300032 |
|---|---|
| author | Wan, Cong Guo, Zeyu Li, Jiangyang Dong, SongLin Bai, Yifan Peng, Lin Ma, Zhiheng Gong, Yihong |
| author_facet | Wan, Cong Guo, Zeyu Li, Jiangyang Dong, SongLin Bai, Yifan Peng, Lin Ma, Zhiheng Gong, Yihong |
| contents | We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2603_00461 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | ReMoT: Reinforcement Learning with Motion Contrast Triplets Wan, Cong Guo, Zeyu Li, Jiangyang Dong, SongLin Bai, Yifan Peng, Lin Ma, Zhiheng Gong, Yihong Computer Vision and Pattern Recognition We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks. |
| title | ReMoT: Reinforcement Learning with Motion Contrast Triplets |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2603.00461 |