Saved in:
| Main Authors: | Jiang, Songtao, Song, Sibo, Zhou, Chenyi, Wang, Yuan, Chen, Ruizhe, Guan, Tongkun, Luo, Ruilin, Zhang, Yan, Tang, Zhihang, Sun, Yuchong, Zhang, Hang, Yang, Zhibo, Bai, Shuai, Lin, Junyang, Liu, Zuozhu |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.17693 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning
by: Luo, Ruilin, et al.
Published: (2026)
by: Luo, Ruilin, et al.
Published: (2026)
CAPO: Reinforcing Consistent Reasoning in Medical Decision-Making
by: Jiang, Songtao, et al.
Published: (2025)
by: Jiang, Songtao, et al.
Published: (2025)
CodePercept: Code-Grounded Visual STEM Perception for MLLMs
by: Guan, Tongkun, et al.
Published: (2026)
by: Guan, Tongkun, et al.
Published: (2026)
Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering
by: Jiang, Songtao, et al.
Published: (2025)
by: Jiang, Songtao, et al.
Published: (2025)
Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning
by: Chen, Ruizhe, et al.
Published: (2025)
by: Chen, Ruizhe, et al.
Published: (2025)
Towards Temporal Compositional Reasoning in Long-Form Sports Videos
by: Cao, Siyu, et al.
Published: (2026)
by: Cao, Siyu, et al.
Published: (2026)
BiasGuard: A Reasoning-enhanced Bias Detection Tool For Large Language Models
by: Fan, Zhiting, et al.
Published: (2025)
by: Fan, Zhiting, et al.
Published: (2025)
How Far Are Video Models from True Multimodal Reasoning?
by: Zhang, Xiaotian, et al.
Published: (2026)
by: Zhang, Xiaotian, et al.
Published: (2026)
UniVBench: Towards Unified Evaluation for Video Foundation Models
by: Wei, Jianhui, et al.
Published: (2026)
by: Wei, Jianhui, et al.
Published: (2026)
HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models
by: Jiang, Songtao, et al.
Published: (2025)
by: Jiang, Songtao, et al.
Published: (2025)
Knowing or Guessing? Robust Medical Visual Question Answering via Joint Consistency and Contrastive Learning
by: Jiang, Songtao, et al.
Published: (2025)
by: Jiang, Songtao, et al.
Published: (2025)
Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models
by: Jiang, Songtao, et al.
Published: (2024)
by: Jiang, Songtao, et al.
Published: (2024)
Temporal Reasoning Transfer from Text to Video
by: Li, Lei, et al.
Published: (2024)
by: Li, Lei, et al.
Published: (2024)
OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding
by: Jiang, Songtao, et al.
Published: (2025)
by: Jiang, Songtao, et al.
Published: (2025)
VideoPro: Adaptive Program Reasoning for Long Video Understanding
by: Li, Chenglin, et al.
Published: (2025)
by: Li, Chenglin, et al.
Published: (2025)
PAD: Personalized Alignment of LLMs at Decoding-Time
by: Chen, Ruizhe, et al.
Published: (2024)
by: Chen, Ruizhe, et al.
Published: (2024)
FAIntbench: A Holistic and Precise Benchmark for Bias Evaluation in Text-to-Image Models
by: Luo, Hanjun, et al.
Published: (2024)
by: Luo, Hanjun, et al.
Published: (2024)
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
by: Li, Mingxin, et al.
Published: (2026)
by: Li, Mingxin, et al.
Published: (2026)
Revisiting Multimodal Positional Encoding in Vision-Language Models
by: Huang, Jie, et al.
Published: (2025)
by: Huang, Jie, et al.
Published: (2025)
Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment
by: Zhang, Xiaotian, et al.
Published: (2025)
by: Zhang, Xiaotian, et al.
Published: (2025)
OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer
by: Zhang, Pengze, et al.
Published: (2026)
by: Zhang, Pengze, et al.
Published: (2026)
Preparing Quantum Backflow States by Large Momentum Transfer
by: Chen, Yuchong, et al.
Published: (2026)
by: Chen, Yuchong, et al.
Published: (2026)
Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors
by: Guan, Tongkun, et al.
Published: (2023)
by: Guan, Tongkun, et al.
Published: (2023)
VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning
by: Zhang, Jinglei, et al.
Published: (2025)
by: Zhang, Jinglei, et al.
Published: (2025)
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
by: Guan, Kaisi, et al.
Published: (2025)
by: Guan, Kaisi, et al.
Published: (2025)
Modality-Fair Preference Optimization for Trustworthy MLLM Alignment
by: Jiang, Songtao, et al.
Published: (2024)
by: Jiang, Songtao, et al.
Published: (2024)
Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph
by: Wang, Wentao, et al.
Published: (2025)
by: Wang, Wentao, et al.
Published: (2025)
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
by: Li, Chenglin, et al.
Published: (2026)
by: Li, Chenglin, et al.
Published: (2026)
Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos
by: Zhao, Zecheng, et al.
Published: (2025)
by: Zhao, Zecheng, et al.
Published: (2025)
Ladder: A Model-Agnostic Framework Boosting LLM-based Machine Translation to the Next Level
by: Feng, Zhaopeng, et al.
Published: (2024)
by: Feng, Zhaopeng, et al.
Published: (2024)
V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning
by: Hua, Hang, et al.
Published: (2024)
by: Hua, Hang, et al.
Published: (2024)
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning
by: Zhang, Haoji, et al.
Published: (2025)
by: Zhang, Haoji, et al.
Published: (2025)
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability
by: Wang, Jiankang, et al.
Published: (2025)
by: Wang, Jiankang, et al.
Published: (2025)
V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis
by: Wang, Yuan, et al.
Published: (2025)
by: Wang, Yuan, et al.
Published: (2025)
Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
by: Liu, Xiaolin, et al.
Published: (2026)
by: Liu, Xiaolin, et al.
Published: (2026)
TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding
by: Yang, Zuhao, et al.
Published: (2025)
by: Yang, Zuhao, et al.
Published: (2025)
VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning
by: Ding, Yang, et al.
Published: (2025)
by: Ding, Yang, et al.
Published: (2025)
Large Language Model Bias Mitigation from the Perspective of Knowledge Editing
by: Chen, Ruizhe, et al.
Published: (2024)
by: Chen, Ruizhe, et al.
Published: (2024)
FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs
by: Fan, Zhiting, et al.
Published: (2024)
by: Fan, Zhiting, et al.
Published: (2024)
BiasAlert: A Plug-and-play Tool for Social Bias Detection in LLMs
by: Fan, Zhiting, et al.
Published: (2024)
by: Fan, Zhiting, et al.
Published: (2024)
Similar Items
-
From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning
by: Luo, Ruilin, et al.
Published: (2026) -
CAPO: Reinforcing Consistent Reasoning in Medical Decision-Making
by: Jiang, Songtao, et al.
Published: (2025) -
CodePercept: Code-Grounded Visual STEM Perception for MLLMs
by: Guan, Tongkun, et al.
Published: (2026) -
Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering
by: Jiang, Songtao, et al.
Published: (2025) -
Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning
by: Chen, Ruizhe, et al.
Published: (2025)