:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Jiang, Songtao, Song, Sibo, Zhou, Chenyi, Wang, Yuan, Chen, Ruizhe, Guan, Tongkun, Luo, Ruilin, Zhang, Yan, Tang, Zhihang, Sun, Yuchong, Zhang, Hang, Yang, Zhibo, Bai, Shuai, Lin, Junyang, Liu, Zuozhu
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2603.17693
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning
by: Luo, Ruilin, et al.
Published: (2026)

CAPO: Reinforcing Consistent Reasoning in Medical Decision-Making
by: Jiang, Songtao, et al.
Published: (2025)

CodePercept: Code-Grounded Visual STEM Perception for MLLMs
by: Guan, Tongkun, et al.
Published: (2026)

Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering
by: Jiang, Songtao, et al.
Published: (2025)

Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning
by: Chen, Ruizhe, et al.
Published: (2025)

Towards Temporal Compositional Reasoning in Long-Form Sports Videos
by: Cao, Siyu, et al.
Published: (2026)

BiasGuard: A Reasoning-enhanced Bias Detection Tool For Large Language Models
by: Fan, Zhiting, et al.
Published: (2025)

How Far Are Video Models from True Multimodal Reasoning?
by: Zhang, Xiaotian, et al.
Published: (2026)

UniVBench: Towards Unified Evaluation for Video Foundation Models
by: Wei, Jianhui, et al.
Published: (2026)

HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models
by: Jiang, Songtao, et al.
Published: (2025)

Knowing or Guessing? Robust Medical Visual Question Answering via Joint Consistency and Contrastive Learning
by: Jiang, Songtao, et al.
Published: (2025)

Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models
by: Jiang, Songtao, et al.
Published: (2024)

Temporal Reasoning Transfer from Text to Video
by: Li, Lei, et al.
Published: (2024)

OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding
by: Jiang, Songtao, et al.
Published: (2025)

VideoPro: Adaptive Program Reasoning for Long Video Understanding
by: Li, Chenglin, et al.
Published: (2025)

PAD: Personalized Alignment of LLMs at Decoding-Time
by: Chen, Ruizhe, et al.
Published: (2024)

FAIntbench: A Holistic and Precise Benchmark for Bias Evaluation in Text-to-Image Models
by: Luo, Hanjun, et al.
Published: (2024)

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
by: Li, Mingxin, et al.
Published: (2026)

Revisiting Multimodal Positional Encoding in Vision-Language Models
by: Huang, Jie, et al.
Published: (2025)

Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment
by: Zhang, Xiaotian, et al.
Published: (2025)

OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer
by: Zhang, Pengze, et al.
Published: (2026)

Preparing Quantum Backflow States by Large Momentum Transfer
by: Chen, Yuchong, et al.
Published: (2026)

Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors
by: Guan, Tongkun, et al.
Published: (2023)

VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning
by: Zhang, Jinglei, et al.
Published: (2025)

ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
by: Guan, Kaisi, et al.
Published: (2025)

Modality-Fair Preference Optimization for Trustworthy MLLM Alignment
by: Jiang, Songtao, et al.
Published: (2024)

Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph
by: Wang, Wentao, et al.
Published: (2025)

VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
by: Li, Chenglin, et al.
Published: (2026)

Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos
by: Zhao, Zecheng, et al.
Published: (2025)

Ladder: A Model-Agnostic Framework Boosting LLM-based Machine Translation to the Next Level
by: Feng, Zhaopeng, et al.
Published: (2024)

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning
by: Hua, Hang, et al.
Published: (2024)

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning
by: Zhang, Haoji, et al.
Published: (2025)

SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability
by: Wang, Jiankang, et al.
Published: (2025)

V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis
by: Wang, Yuan, et al.
Published: (2025)

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
by: Liu, Xiaolin, et al.
Published: (2026)

TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding
by: Yang, Zuhao, et al.
Published: (2025)

VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning
by: Ding, Yang, et al.
Published: (2025)

Large Language Model Bias Mitigation from the Perspective of Knowledge Editing
by: Chen, Ruizhe, et al.
Published: (2024)

FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs
by: Fan, Zhiting, et al.
Published: (2024)

BiasAlert: A Plug-and-play Tool for Social Bias Detection in LLMs
by: Fan, Zhiting, et al.
Published: (2024)