:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhu, Tinghui, Zhang, Sheng, Huang, James Y., Song, Selena, Wen, Xiaofei, Li, Yuankai, Poon, Hoifung, Chen, Muhao
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2605.15458
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Learning Adaptive Reasoning Paths for Efficient Visual Reasoning
by: Huang, Yixu, et al.
Published: (2026)

mDPO: Conditional Preference Optimization for Multimodal Large Language Models
by: Wang, Fei, et al.
Published: (2024)

From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning
by: Xu, Nan, et al.
Published: (2024)

OmniGuard: Unified Omni-Modal Guardrails with Deliberate Reasoning
by: Zhu, Boyu, et al.
Published: (2025)

Is Extending Modality The Right Path Towards Omni-Modality?
by: Zhu, Tinghui, et al.
Published: (2025)

AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees
by: Li, Yuankai, et al.
Published: (2026)

Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection
by: Li, Bangzheng, et al.
Published: (2025)

Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models
by: Zhu, Tinghui, et al.
Published: (2024)

When Vision Speaks for Sound
by: Wen, Xiaofei, et al.
Published: (2026)

Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models
by: Cai, Rui, et al.
Published: (2025)

Foundation Models for Biomedical Image Segmentation: A Survey
by: Lee, Ho Hin, et al.
Published: (2024)

Smooth Operator: Smooth Verifiable Reward Activates Spatial Reasoning Ability of Vision-Language Model
by: Jiao, Siwen, et al.
Published: (2026)

Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration
by: Huang, James Y., et al.
Published: (2025)

Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos
by: Zhao, Zecheng, et al.
Published: (2025)

Taming Camera-Controlled Video Generation with Verifiable Geometry Reward
by: Wang, Zhaoqing, et al.
Published: (2025)

Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts
by: Yang, Yanting, et al.
Published: (2024)

Learning Plug-and-play Memory for Guiding Video Diffusion Models
by: Song, Selena, et al.
Published: (2025)

Wan-R1: Verifiable-Reinforcement Learning for Video Reasoning
by: Liu, Ming, et al.
Published: (2026)

Boltzmann Attention Sampling for Image Analysis with Small Objects
by: Zhao, Theodore, et al.
Published: (2025)

Chart-RVR: Reinforcement Learning with Verifiable Rewards for Explainable Chart Reasoning
by: Sinha, Sanchit, et al.
Published: (2025)

VideoRewardBench: Comprehensive Evaluation of Multimodal Reward Models for Video Understanding
by: Zhang, Zhihong, et al.
Published: (2025)

Few-Shot Vision-Language Reasoning for Satellite Imagery via Verifiable Rewards
by: Koksal, Aybora, et al.
Published: (2025)

Video-Based Reward Modeling for Computer-Use Agents
by: Song, Linxin, et al.
Published: (2026)

VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?
by: Liu, Yuanxin, et al.
Published: (2025)

Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
by: Wang, Yuan, et al.
Published: (2026)

What, Whether and How? Unveiling Process Reward Models for Thinking with Images Reasoning
by: Zhou, Yujin, et al.
Published: (2026)

SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation
by: Zhou, Sashuai, et al.
Published: (2026)

MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning
by: Tao, Sicheng, et al.
Published: (2025)

Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains
by: Zhang, Jesen, et al.
Published: (2025)

VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
by: Wang, Qunzhong, et al.
Published: (2025)

RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning
by: Ma, Yunchuan, et al.
Published: (2024)

Organoid Tracker: A SAM2-Powered Platform for Zero-shot Cyst Analysis in Human Kidney Organoid Videos
by: Huang, Xiaoyu, et al.
Published: (2025)

Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning
by: Wang, Xiaokun, et al.
Published: (2025)

EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning
by: Kulkarni, Yogesh, et al.
Published: (2025)

Evidence-Based Actor-Verifier Reasoning for Echocardiographic Agents
by: Huang, Peng, et al.
Published: (2026)

TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
by: Zhang, Xingjian, et al.
Published: (2025)

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
by: Lu, Yunhong, et al.
Published: (2025)

SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image Diffusion Models
by: Wang, Zilan, et al.
Published: (2024)

Scaling medical imaging report generation with multimodal reinforcement learning
by: Liu, Qianchu, et al.
Published: (2026)

What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards
by: Le, Minh-Quan, et al.
Published: (2025)