Saved in:
| Main Authors: | Dong, Yuhao, Tian, Shulin, Liu, Shuai, Ding, Shuangrui, Zang, Yuhang, Dong, Xiaoyi, Cao, Yuhang, Wang, Jiaqi, Liu, Ziwei |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.08439 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
2nd Place Report of MOSEv2 Challenge 2025: Concept Guided Video Object Segmentation via SeC
by: Zhang, Zhixiong, et al.
Published: (2025)
by: Zhang, Zhixiong, et al.
Published: (2025)
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
by: Qian, Rui, et al.
Published: (2025)
by: Qian, Rui, et al.
Published: (2025)
Streaming Long Video Understanding with Large Language Models
by: Qian, Rui, et al.
Published: (2024)
by: Qian, Rui, et al.
Published: (2024)
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
by: Ding, Shuangrui, et al.
Published: (2024)
by: Ding, Shuangrui, et al.
Published: (2024)
Advancing Complex Video Object Segmentation via Progressive Concept Construction
by: Zhang, Zhixiong, et al.
Published: (2025)
by: Zhang, Zhixiong, et al.
Published: (2025)
Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing
by: Li, Jinsong, et al.
Published: (2026)
by: Li, Jinsong, et al.
Published: (2026)
SPARK: Synergistic Policy And Reward Co-Evolving Framework
by: Liu, Ziyu, et al.
Published: (2025)
by: Liu, Ziyu, et al.
Published: (2025)
Visual-RFT: Visual Reinforcement Fine-Tuning
by: Liu, Ziyu, et al.
Published: (2025)
by: Liu, Ziyu, et al.
Published: (2025)
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
by: Li, Yifei, et al.
Published: (2025)
by: Li, Yifei, et al.
Published: (2025)
ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way
by: Bu, Jiazi, et al.
Published: (2024)
by: Bu, Jiazi, et al.
Published: (2024)
Long-CLIP: Unlocking the Long-Text Capability of CLIP
by: Zhang, Beichen, et al.
Published: (2024)
by: Zhang, Beichen, et al.
Published: (2024)
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
by: Liu, Yuhong, et al.
Published: (2025)
by: Liu, Yuhong, et al.
Published: (2025)
Unified Scene Representation and Reconstruction for 3D Large Language Models
by: Chu, Tao, et al.
Published: (2024)
by: Chu, Tao, et al.
Published: (2024)
ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing
by: Xing, Long, et al.
Published: (2025)
by: Xing, Long, et al.
Published: (2025)
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
by: Wei, Xilin, et al.
Published: (2025)
by: Wei, Xilin, et al.
Published: (2025)
Think Visually, Reason Textually: Vision-Language Synergy in ARC
by: Zhang, Beichen, et al.
Published: (2025)
by: Zhang, Beichen, et al.
Published: (2025)
Visual Agentic Reinforcement Fine-Tuning
by: Liu, Ziyu, et al.
Published: (2025)
by: Liu, Ziyu, et al.
Published: (2025)
SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
by: Sun, Zeyi, et al.
Published: (2025)
by: Sun, Zeyi, et al.
Published: (2025)
MM-IFEngine: Towards Multimodal Instruction Following
by: Ding, Shengyuan, et al.
Published: (2025)
by: Ding, Shengyuan, et al.
Published: (2025)
DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models
by: Cao, Yuhang, et al.
Published: (2024)
by: Cao, Yuhang, et al.
Published: (2024)
CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
by: Xing, Long, et al.
Published: (2025)
by: Xing, Long, et al.
Published: (2025)
Light-A-Video: Training-free Video Relighting via Progressive Light Fusion
by: Zhou, Yujie, et al.
Published: (2025)
by: Zhou, Yujie, et al.
Published: (2025)
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
by: Huang, Qidong, et al.
Published: (2024)
by: Huang, Qidong, et al.
Published: (2024)
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
by: Liu, Ziyu, et al.
Published: (2024)
by: Liu, Ziyu, et al.
Published: (2024)
HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance
by: Bu, Jiazi, et al.
Published: (2025)
by: Bu, Jiazi, et al.
Published: (2025)
SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction
by: Zhang, Zhixiong, et al.
Published: (2026)
by: Zhang, Zhixiong, et al.
Published: (2026)
Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models
by: Dong, Yuhao, et al.
Published: (2026)
by: Dong, Yuhao, et al.
Published: (2026)
MotionClone: Training-Free Motion Cloning for Controllable Video Generation
by: Ling, Pengyang, et al.
Published: (2024)
by: Ling, Pengyang, et al.
Published: (2024)
CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning
by: Sun, Zeyi, et al.
Published: (2025)
by: Sun, Zeyi, et al.
Published: (2025)
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
by: Ding, Shengyuan, et al.
Published: (2025)
by: Ding, Shengyuan, et al.
Published: (2025)
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
by: Zang, Yuhang, et al.
Published: (2025)
by: Zang, Yuhang, et al.
Published: (2025)
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
by: Xing, Long, et al.
Published: (2024)
by: Xing, Long, et al.
Published: (2024)
Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos
by: Qian, Rui, et al.
Published: (2023)
by: Qian, Rui, et al.
Published: (2023)
WildAvatar: Learning In-the-wild 3D Avatars from the Web
by: Huang, Zihao, et al.
Published: (2024)
by: Huang, Zihao, et al.
Published: (2024)
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
by: Sun, Zeyi, et al.
Published: (2024)
by: Sun, Zeyi, et al.
Published: (2024)
FileGram: Grounding Agent Personalization in File-System Behavioral Traces
by: Liu, Shuai, et al.
Published: (2026)
by: Liu, Shuai, et al.
Published: (2026)
RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition
by: Liu, Ziyu, et al.
Published: (2024)
by: Liu, Ziyu, et al.
Published: (2024)
Rethinking Image-to-Video Adaptation: An Object-centric Perspective
by: Qian, Rui, et al.
Published: (2024)
by: Qian, Rui, et al.
Published: (2024)
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
by: Tian, Shulin, et al.
Published: (2025)
by: Tian, Shulin, et al.
Published: (2025)
Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark
by: Zou, Kai, et al.
Published: (2025)
by: Zou, Kai, et al.
Published: (2025)
Similar Items
-
2nd Place Report of MOSEv2 Challenge 2025: Concept Guided Video Object Segmentation via SeC
by: Zhang, Zhixiong, et al.
Published: (2025) -
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
by: Qian, Rui, et al.
Published: (2025) -
Streaming Long Video Understanding with Large Language Models
by: Qian, Rui, et al.
Published: (2024) -
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
by: Ding, Shuangrui, et al.
Published: (2024) -
Advancing Complex Video Object Segmentation via Progressive Concept Construction
by: Zhang, Zhixiong, et al.
Published: (2025)