:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Dong, Yuhao, Tian, Shulin, Liu, Shuai, Ding, Shuangrui, Zang, Yuhang, Dong, Xiaoyi, Cao, Yuhang, Wang, Jiaqi, Liu, Ziwei
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2602.08439
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

2nd Place Report of MOSEv2 Challenge 2025: Concept Guided Video Object Segmentation via SeC
by: Zhang, Zhixiong, et al.
Published: (2025)

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
by: Qian, Rui, et al.
Published: (2025)

Streaming Long Video Understanding with Large Language Models
by: Qian, Rui, et al.
Published: (2024)

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
by: Ding, Shuangrui, et al.
Published: (2024)

Advancing Complex Video Object Segmentation via Progressive Concept Construction
by: Zhang, Zhixiong, et al.
Published: (2025)

Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing
by: Li, Jinsong, et al.
Published: (2026)

SPARK: Synergistic Policy And Reward Co-Evolving Framework
by: Liu, Ziyu, et al.
Published: (2025)

Visual-RFT: Visual Reinforcement Fine-Tuning
by: Liu, Ziyu, et al.
Published: (2025)

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
by: Li, Yifei, et al.
Published: (2025)

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way
by: Bu, Jiazi, et al.
Published: (2024)

Long-CLIP: Unlocking the Long-Text Capability of CLIP
by: Zhang, Beichen, et al.
Published: (2024)

Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
by: Liu, Yuhong, et al.
Published: (2025)

Unified Scene Representation and Reconstruction for 3D Large Language Models
by: Chu, Tao, et al.
Published: (2024)

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing
by: Xing, Long, et al.
Published: (2025)

VideoRoPE: What Makes for Good Video Rotary Position Embedding?
by: Wei, Xilin, et al.
Published: (2025)

Think Visually, Reason Textually: Vision-Language Synergy in ARC
by: Zhang, Beichen, et al.
Published: (2025)

Visual Agentic Reinforcement Fine-Tuning
by: Liu, Ziyu, et al.
Published: (2025)

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
by: Sun, Zeyi, et al.
Published: (2025)

MM-IFEngine: Towards Multimodal Instruction Following
by: Ding, Shengyuan, et al.
Published: (2025)

DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models
by: Cao, Yuhang, et al.
Published: (2024)

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
by: Xing, Long, et al.
Published: (2025)

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion
by: Zhou, Yujie, et al.
Published: (2025)

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
by: Huang, Qidong, et al.
Published: (2024)

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
by: Liu, Ziyu, et al.
Published: (2024)

HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance
by: Bu, Jiazi, et al.
Published: (2025)

SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction
by: Zhang, Zhixiong, et al.
Published: (2026)

Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models
by: Dong, Yuhao, et al.
Published: (2026)

MotionClone: Training-Free Motion Cloning for Controllable Video Generation
by: Ling, Pengyang, et al.
Published: (2024)

CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning
by: Sun, Zeyi, et al.
Published: (2025)

ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
by: Ding, Shengyuan, et al.
Published: (2025)

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
by: Zang, Yuhang, et al.
Published: (2025)

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
by: Xing, Long, et al.
Published: (2024)

Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos
by: Qian, Rui, et al.
Published: (2023)

WildAvatar: Learning In-the-wild 3D Avatars from the Web
by: Huang, Zihao, et al.
Published: (2024)

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
by: Sun, Zeyi, et al.
Published: (2024)

FileGram: Grounding Agent Personalization in File-System Behavioral Traces
by: Liu, Shuai, et al.
Published: (2026)

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition
by: Liu, Ziyu, et al.
Published: (2024)

Rethinking Image-to-Video Adaptation: An Object-centric Perspective
by: Qian, Rui, et al.
Published: (2024)

Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
by: Tian, Shulin, et al.
Published: (2025)

Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark
by: Zou, Kai, et al.
Published: (2025)