Saved in:
| Main Authors: | Du, Henghui, Zhou, Chang, Chen, Xi, Hu, Di |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.23823 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos
by: Du, Henghui, et al.
Published: (2025)
by: Du, Henghui, et al.
Published: (2025)
Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation
by: He, Shuting, et al.
Published: (2024)
by: He, Shuting, et al.
Published: (2024)
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
by: Du, Henghui, et al.
Published: (2025)
by: Du, Henghui, et al.
Published: (2025)
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
by: Li, Guangyao, et al.
Published: (2024)
by: Li, Guangyao, et al.
Published: (2024)
Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
by: Cai, Dongnuan, et al.
Published: (2026)
by: Cai, Dongnuan, et al.
Published: (2026)
SAM3-DMS: Decoupled Memory Selection for Multi-target Video Segmentation of SAM3
by: Shen, Ruiqi, et al.
Published: (2026)
by: Shen, Ruiqi, et al.
Published: (2026)
MOVE: Motion-Guided Few-Shot Video Object Segmentation
by: Ying, Kaining, et al.
Published: (2025)
by: Ying, Kaining, et al.
Published: (2025)
On-the-fly Modulation for Balanced Multimodal Learning
by: Wei, Yake, et al.
Published: (2024)
by: Wei, Yake, et al.
Published: (2024)
KVQ: Boosting Video Quality Assessment via Saliency-guided Local Perception
by: Qu, Yunpeng, et al.
Published: (2025)
by: Qu, Yunpeng, et al.
Published: (2025)
Segment Anything Across Shots: A Method and Benchmark
by: Hu, Hengrui, et al.
Published: (2025)
by: Hu, Hengrui, et al.
Published: (2025)
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization
by: Li, Yunxin, et al.
Published: (2025)
by: Li, Yunxin, et al.
Published: (2025)
MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation
by: Ding, Henghui, et al.
Published: (2025)
by: Ding, Henghui, et al.
Published: (2025)
CPPO: Contrastive Perception Policy Optimization for VLM Agents
by: Rezaei, Ahmad, et al.
Published: (2026)
by: Rezaei, Ahmad, et al.
Published: (2026)
Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
by: Ying, Kaining, et al.
Published: (2025)
by: Ying, Kaining, et al.
Published: (2025)
EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing
by: Fu, Yang, et al.
Published: (2026)
by: Fu, Yang, et al.
Published: (2026)
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval
by: Jeong, Boseung, et al.
Published: (2025)
by: Jeong, Boseung, et al.
Published: (2025)
Decoupling Perception from Reasoning for Hallucination-Resistant Video Understanding
by: Pu, Bowei, et al.
Published: (2025)
by: Pu, Bowei, et al.
Published: (2025)
Artemis: Structured Visual Reasoning for Perception Policy Learning
by: Tang, Wei, et al.
Published: (2025)
by: Tang, Wei, et al.
Published: (2025)
On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training
by: Wu, Xueqing, et al.
Published: (2026)
by: Wu, Xueqing, et al.
Published: (2026)
RefMask3D: Language-Guided Transformer for 3D Referring Segmentation
by: He, Shuting, et al.
Published: (2024)
by: He, Shuting, et al.
Published: (2024)
GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering
by: Shuai, Xincheng, et al.
Published: (2026)
by: Shuai, Xincheng, et al.
Published: (2026)
Transferable-guided Attention Is All You Need for Video Domain Adaptation
by: Sacilotti, André, et al.
Published: (2024)
by: Sacilotti, André, et al.
Published: (2024)
Hybrid Latent Reasoning with Decoupled Policy Optimization
by: Cheng, Tao, et al.
Published: (2026)
by: Cheng, Tao, et al.
Published: (2026)
Rate-Distortion Optimized Communication for Collaborative Perception
by: Liu, Genjia, et al.
Published: (2025)
by: Liu, Genjia, et al.
Published: (2025)
LDA-AQU: Adaptive Query-guided Upsampling via Local Deformable Attention
by: Du, Zewen, et al.
Published: (2024)
by: Du, Zewen, et al.
Published: (2024)
Video-based Heart Rate Estimation with Angle-guided ROI Optimization and Graph Signal Denoising
by: Pei, Gan, et al.
Published: (2026)
by: Pei, Gan, et al.
Published: (2026)
Frequency-guided Multi-level Reasoning for Scene Graph Generation in Video
by: Li, Chenxing, et al.
Published: (2026)
by: Li, Chenxing, et al.
Published: (2026)
LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization
by: Huang, Zhenpeng, et al.
Published: (2026)
by: Huang, Zhenpeng, et al.
Published: (2026)
MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes
by: Ding, Henghui, et al.
Published: (2025)
by: Ding, Henghui, et al.
Published: (2025)
SteerSeg: Attention Steering for Reasoning Video Segmentation
by: Cheraghian, Ali, et al.
Published: (2026)
by: Cheraghian, Ali, et al.
Published: (2026)
Appearance Blur-driven AutoEncoder and Motion-guided Memory Module for Video Anomaly Detection
by: Lyu, Jiahao, et al.
Published: (2024)
by: Lyu, Jiahao, et al.
Published: (2024)
MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding
by: Zhang, Zhicheng, et al.
Published: (2025)
by: Zhang, Zhicheng, et al.
Published: (2025)
SEAL: Semantic Attention Learning for Long Video Representation
by: Wang, Lan, et al.
Published: (2024)
by: Wang, Lan, et al.
Published: (2024)
Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild
by: Hu, Wanpeng, et al.
Published: (2025)
by: Hu, Wanpeng, et al.
Published: (2025)
Weaver: End-to-End Agentic System Training for Video Interleaved Reasoning
by: Shi, Yudi, et al.
Published: (2026)
by: Shi, Yudi, et al.
Published: (2026)
GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering
by: Brilli, Dionysia Danai, et al.
Published: (2025)
by: Brilli, Dionysia Danai, et al.
Published: (2025)
Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility
by: Hao, Yutong, et al.
Published: (2025)
by: Hao, Yutong, et al.
Published: (2025)
G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning
by: Chen, Liang, et al.
Published: (2025)
by: Chen, Liang, et al.
Published: (2025)
Free-Form Motion Control: Controlling the 6D Poses of Camera and Objects in Video Generation
by: Shuai, Xincheng, et al.
Published: (2025)
by: Shuai, Xincheng, et al.
Published: (2025)
Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation
by: Shi, Yudi, et al.
Published: (2024)
by: Shi, Yudi, et al.
Published: (2024)
Similar Items
-
Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos
by: Du, Henghui, et al.
Published: (2025) -
Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation
by: He, Shuting, et al.
Published: (2024) -
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
by: Du, Henghui, et al.
Published: (2025) -
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
by: Li, Guangyao, et al.
Published: (2024) -
Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
by: Cai, Dongnuan, et al.
Published: (2026)