Saved in:
| Main Authors: | Zhu, Tinghui, Zhang, Sheng, Huang, James Y., Song, Selena, Wen, Xiaofei, Li, Yuankai, Poon, Hoifung, Chen, Muhao |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.15458 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Learning Adaptive Reasoning Paths for Efficient Visual Reasoning
by: Huang, Yixu, et al.
Published: (2026)
by: Huang, Yixu, et al.
Published: (2026)
mDPO: Conditional Preference Optimization for Multimodal Large Language Models
by: Wang, Fei, et al.
Published: (2024)
by: Wang, Fei, et al.
Published: (2024)
From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning
by: Xu, Nan, et al.
Published: (2024)
by: Xu, Nan, et al.
Published: (2024)
OmniGuard: Unified Omni-Modal Guardrails with Deliberate Reasoning
by: Zhu, Boyu, et al.
Published: (2025)
by: Zhu, Boyu, et al.
Published: (2025)
Is Extending Modality The Right Path Towards Omni-Modality?
by: Zhu, Tinghui, et al.
Published: (2025)
by: Zhu, Tinghui, et al.
Published: (2025)
AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees
by: Li, Yuankai, et al.
Published: (2026)
by: Li, Yuankai, et al.
Published: (2026)
Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection
by: Li, Bangzheng, et al.
Published: (2025)
by: Li, Bangzheng, et al.
Published: (2025)
Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models
by: Zhu, Tinghui, et al.
Published: (2024)
by: Zhu, Tinghui, et al.
Published: (2024)
When Vision Speaks for Sound
by: Wen, Xiaofei, et al.
Published: (2026)
by: Wen, Xiaofei, et al.
Published: (2026)
Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models
by: Cai, Rui, et al.
Published: (2025)
by: Cai, Rui, et al.
Published: (2025)
Foundation Models for Biomedical Image Segmentation: A Survey
by: Lee, Ho Hin, et al.
Published: (2024)
by: Lee, Ho Hin, et al.
Published: (2024)
Smooth Operator: Smooth Verifiable Reward Activates Spatial Reasoning Ability of Vision-Language Model
by: Jiao, Siwen, et al.
Published: (2026)
by: Jiao, Siwen, et al.
Published: (2026)
Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration
by: Huang, James Y., et al.
Published: (2025)
by: Huang, James Y., et al.
Published: (2025)
Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos
by: Zhao, Zecheng, et al.
Published: (2025)
by: Zhao, Zecheng, et al.
Published: (2025)
Taming Camera-Controlled Video Generation with Verifiable Geometry Reward
by: Wang, Zhaoqing, et al.
Published: (2025)
by: Wang, Zhaoqing, et al.
Published: (2025)
Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts
by: Yang, Yanting, et al.
Published: (2024)
by: Yang, Yanting, et al.
Published: (2024)
Learning Plug-and-play Memory for Guiding Video Diffusion Models
by: Song, Selena, et al.
Published: (2025)
by: Song, Selena, et al.
Published: (2025)
Wan-R1: Verifiable-Reinforcement Learning for Video Reasoning
by: Liu, Ming, et al.
Published: (2026)
by: Liu, Ming, et al.
Published: (2026)
Boltzmann Attention Sampling for Image Analysis with Small Objects
by: Zhao, Theodore, et al.
Published: (2025)
by: Zhao, Theodore, et al.
Published: (2025)
Chart-RVR: Reinforcement Learning with Verifiable Rewards for Explainable Chart Reasoning
by: Sinha, Sanchit, et al.
Published: (2025)
by: Sinha, Sanchit, et al.
Published: (2025)
VideoRewardBench: Comprehensive Evaluation of Multimodal Reward Models for Video Understanding
by: Zhang, Zhihong, et al.
Published: (2025)
by: Zhang, Zhihong, et al.
Published: (2025)
Few-Shot Vision-Language Reasoning for Satellite Imagery via Verifiable Rewards
by: Koksal, Aybora, et al.
Published: (2025)
by: Koksal, Aybora, et al.
Published: (2025)
Video-Based Reward Modeling for Computer-Use Agents
by: Song, Linxin, et al.
Published: (2026)
by: Song, Linxin, et al.
Published: (2026)
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?
by: Liu, Yuanxin, et al.
Published: (2025)
by: Liu, Yuanxin, et al.
Published: (2025)
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
by: Wang, Yuan, et al.
Published: (2026)
by: Wang, Yuan, et al.
Published: (2026)
What, Whether and How? Unveiling Process Reward Models for Thinking with Images Reasoning
by: Zhou, Yujin, et al.
Published: (2026)
by: Zhou, Yujin, et al.
Published: (2026)
SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation
by: Zhou, Sashuai, et al.
Published: (2026)
by: Zhou, Sashuai, et al.
Published: (2026)
MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning
by: Tao, Sicheng, et al.
Published: (2025)
by: Tao, Sicheng, et al.
Published: (2025)
Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains
by: Zhang, Jesen, et al.
Published: (2025)
by: Zhang, Jesen, et al.
Published: (2025)
VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
by: Wang, Qunzhong, et al.
Published: (2025)
by: Wang, Qunzhong, et al.
Published: (2025)
RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning
by: Ma, Yunchuan, et al.
Published: (2024)
by: Ma, Yunchuan, et al.
Published: (2024)
Organoid Tracker: A SAM2-Powered Platform for Zero-shot Cyst Analysis in Human Kidney Organoid Videos
by: Huang, Xiaoyu, et al.
Published: (2025)
by: Huang, Xiaoyu, et al.
Published: (2025)
Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning
by: Wang, Xiaokun, et al.
Published: (2025)
by: Wang, Xiaokun, et al.
Published: (2025)
EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning
by: Kulkarni, Yogesh, et al.
Published: (2025)
by: Kulkarni, Yogesh, et al.
Published: (2025)
Evidence-Based Actor-Verifier Reasoning for Echocardiographic Agents
by: Huang, Peng, et al.
Published: (2026)
by: Huang, Peng, et al.
Published: (2026)
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
by: Zhang, Xingjian, et al.
Published: (2025)
by: Zhang, Xingjian, et al.
Published: (2025)
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
by: Lu, Yunhong, et al.
Published: (2025)
by: Lu, Yunhong, et al.
Published: (2025)
SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image Diffusion Models
by: Wang, Zilan, et al.
Published: (2024)
by: Wang, Zilan, et al.
Published: (2024)
Scaling medical imaging report generation with multimodal reinforcement learning
by: Liu, Qianchu, et al.
Published: (2026)
by: Liu, Qianchu, et al.
Published: (2026)
What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards
by: Le, Minh-Quan, et al.
Published: (2025)
by: Le, Minh-Quan, et al.
Published: (2025)
Similar Items
-
Learning Adaptive Reasoning Paths for Efficient Visual Reasoning
by: Huang, Yixu, et al.
Published: (2026) -
mDPO: Conditional Preference Optimization for Multimodal Large Language Models
by: Wang, Fei, et al.
Published: (2024) -
From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning
by: Xu, Nan, et al.
Published: (2024) -
OmniGuard: Unified Omni-Modal Guardrails with Deliberate Reasoning
by: Zhu, Boyu, et al.
Published: (2025) -
Is Extending Modality The Right Path Towards Omni-Modality?
by: Zhu, Tinghui, et al.
Published: (2025)