:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Xu, Yicheng, Wu, Yue, Yu, Jiashuo, Yan, Ziang, Jiang, Tianxiang, He, Yinan, Zhao, Qingsong, Chen, Kai, Qiao, Yu, Wang, Limin, Okumura, Manabu, Wang, Yi
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2510.11606
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
by: Wang, Yi, et al.
Published: (2023)

VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
by: Yan, Ziang, et al.
Published: (2025)

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
by: Wang, Yi, et al.
Published: (2024)

RIVER: A Real-Time Interaction Benchmark for Video LLMs
by: Shi, Yansong, et al.
Published: (2026)

VideoMamba: State Space Model for Efficient Video Understanding
by: Li, Kunchang, et al.
Published: (2024)

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
by: Yu, Jiashuo, et al.
Published: (2025)

VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs
by: Jiang, Tianxiang, et al.
Published: (2025)

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
by: Li, Kunchang, et al.
Published: (2023)

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
by: Zeng, Xiangyu, et al.
Published: (2024)

VideoChat: Chat-Centric Video Understanding
by: Li, KunChang, et al.
Published: (2023)

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
by: Wang, Yi, et al.
Published: (2025)

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
by: Li, Xinhao, et al.
Published: (2025)

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
by: Li, Xinhao, et al.
Published: (2024)

Unmasked Teacher: Towards Training-Efficient Video Foundation Models
by: Li, Kunchang, et al.
Published: (2023)

VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning
by: Wang, Zikang, et al.
Published: (2025)

OmniVid: A Generative Framework for Universal Video Understanding
by: Wang, Junke, et al.
Published: (2024)

Harvest Video Foundation Models via Efficient Post-Pretraining
by: Li, Yizhuo, et al.
Published: (2023)

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models
by: Huang, Ziqi, et al.
Published: (2024)

LvBench: A Benchmark for Long-form Video Understanding with Versatile Multi-modal Question Answering
by: Zhang, Hongjie, et al.
Published: (2023)

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
by: Wang, Chenting, et al.
Published: (2025)

Oogiri-Master: Benchmarking Humor Understanding via Oogiri
by: Murakami, Soichiro, et al.
Published: (2025)

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
by: Chen, Guo, et al.
Published: (2024)

Vid-SME: Membership Inference Attacks against Large Video Understanding Models
by: Li, Qi, et al.
Published: (2025)

Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning
by: Zeng, Xiangyu, et al.
Published: (2026)

Make Your Training Flexible: Towards Deployment-Efficient Video Models
by: Wang, Chenting, et al.
Published: (2025)

VidText: Towards Comprehensive Evaluation for Video Text Understanding
by: Yang, Zhoufaran, et al.
Published: (2025)

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
by: Yan, Ziang, et al.
Published: (2024)

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis
by: Liang, Feng, et al.
Published: (2023)

CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models
by: Li, Jingyao, et al.
Published: (2025)

EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation
by: Qiu, Zongyang, et al.
Published: (2025)

Rethinking the Zigzag Flattening for Image Reading
by: Zhao, Qingsong, et al.
Published: (2022)

VidTwin: Video VAE with Decoupled Structure and Dynamics
by: Wang, Yuchi, et al.
Published: (2024)

TextVidBench: A Benchmark for Long Video Scene Text Understanding
by: Zhong, Yangyang, et al.
Published: (2025)

KG-Reasoner: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning
by: Wang, Shuai, et al.
Published: (2026)

Advancing Cross-domain Discriminability in Continual Learning of Vision-Language Models
by: Xu, Yicheng, et al.
Published: (2024)

SafeVid: Toward Safety Aligned Video Large Multimodal Models
by: Wang, Yixu, et al.
Published: (2025)

VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs
by: Yang, Yiming, et al.
Published: (2025)

Automatic Answerability Evaluation for Question Generation
by: Wang, Zifan, et al.
Published: (2023)

EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation
by: Pei, Baoqi, et al.
Published: (2024)

Taming Recommendation Bias with Causal Intervention on Evolving Personal Popularity
by: Tan, Shiyin, et al.
Published: (2025)