:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhu, Yuhan, Zeng, Xiangyu, Wang, Chenting, Li, Xinhao, Liu, Chunxu, Xu, Yicheng, Yan, Ziang, Wang, Yi, Wang, Limin
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2509.24621
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
by: Wang, Chenting, et al.
Published: (2025)

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
by: Zeng, Xiangyu, et al.
Published: (2024)

Make Your Training Flexible: Towards Deployment-Efficient Video Models
by: Wang, Chenting, et al.
Published: (2025)

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
by: Wang, Yi, et al.
Published: (2025)

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
by: Yan, Ziang, et al.
Published: (2024)

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video
by: Li, Xinhao, et al.
Published: (2023)

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
by: Li, Xinhao, et al.
Published: (2024)

VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
by: Yan, Ziang, et al.
Published: (2025)

Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval
by: Liu, Chunxu, et al.
Published: (2025)

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
by: Li, Xinhao, et al.
Published: (2025)

Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning
by: Zeng, Xiangyu, et al.
Published: (2026)

StreamForest: Efficient Online Video Understanding with Persistent Event Memory
by: Zeng, Xiangyu, et al.
Published: (2025)

UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation
by: Yue, Zhengrong, et al.
Published: (2025)

SORCE: Small Object Retrieval in Complex Environments
by: Liu, Chunxu, et al.
Published: (2025)

History-Aware Transformation of ReID Features for Multiple Object Tracking
by: Gao, Ruopeng, et al.
Published: (2025)

VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs
by: Jiang, Tianxiang, et al.
Published: (2025)

Training-Free Reasoning and Reflection in MLLMs
by: Wei, Hongchen, et al.
Published: (2025)

VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking
by: Meng, Desen, et al.
Published: (2025)

Sparse Global Matching for Video Frame Interpolation with Large Motion
by: Liu, Chunxu, et al.
Published: (2024)

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
by: Wang, Yi, et al.
Published: (2024)

ActErase: A Training-Free Paradigm for Precise Concept Erasure via Activation Redirection
by: Sun, Yi, et al.
Published: (2026)

Differential Vector Erasure: Unified Training-Free Concept Erasure for Flow Matching Models
by: Zhang, Zhiqi, et al.
Published: (2026)

VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations
by: Dong, Lu, et al.
Published: (2025)

VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model
by: Li, Xinhao, et al.
Published: (2024)

CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval
by: Xu, Yifan, et al.
Published: (2024)

VideoMamba: State Space Model for Efficient Video Understanding
by: Li, Kunchang, et al.
Published: (2024)

Training-Free Personalization via Retrieval and Reasoning on Fingerprints
by: Das, Deepayan, et al.
Published: (2025)

RIVER: A Real-Time Interaction Benchmark for Video LLMs
by: Shi, Yansong, et al.
Published: (2026)

Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation
by: Han, Su Ho, et al.
Published: (2025)

VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?
by: Liu, Yuanxin, et al.
Published: (2025)

ExpVid: A Benchmark for Experiment Video Understanding & Reasoning
by: Xu, Yicheng, et al.
Published: (2025)

RetCompletion:High-Speed Inference Image Completion with Retentive Network
by: Cang, Yueyang, et al.
Published: (2024)

ParkingTwin: Training-Free Streaming 3D Reconstruction for Parking-Lot Digital Twins
by: Liu, Xinhao, et al.
Published: (2026)

Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling
by: Wang, Jiahao, et al.
Published: (2025)

Training-Free Pretrained Model Merging
by: Xu, Zhengqi, et al.
Published: (2024)

TTSA3R: Training-Free Temporal-Spatial Adaptive Persistent State for Streaming 3D Reconstruction
by: Zheng, Zhijie, et al.
Published: (2026)

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models
by: Shi, Fengyuan, et al.
Published: (2023)

Online Video Understanding: OVBench and VideoChat-Online
by: Huang, Zhenpeng, et al.
Published: (2024)

Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval
by: Wang, Tong, et al.
Published: (2026)

M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision
by: Liu, Che, et al.
Published: (2025)