Saved in:
| Main Authors: | Xu, Yicheng, Wu, Yue, Yu, Jiashuo, Yan, Ziang, Jiang, Tianxiang, He, Yinan, Zhao, Qingsong, Chen, Kai, Qiao, Yu, Wang, Limin, Okumura, Manabu, Wang, Yi |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.11606 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
by: Wang, Yi, et al.
Published: (2023)
by: Wang, Yi, et al.
Published: (2023)
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
by: Yan, Ziang, et al.
Published: (2025)
by: Yan, Ziang, et al.
Published: (2025)
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
by: Wang, Yi, et al.
Published: (2024)
by: Wang, Yi, et al.
Published: (2024)
RIVER: A Real-Time Interaction Benchmark for Video LLMs
by: Shi, Yansong, et al.
Published: (2026)
by: Shi, Yansong, et al.
Published: (2026)
VideoMamba: State Space Model for Efficient Video Understanding
by: Li, Kunchang, et al.
Published: (2024)
by: Li, Kunchang, et al.
Published: (2024)
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
by: Yu, Jiashuo, et al.
Published: (2025)
by: Yu, Jiashuo, et al.
Published: (2025)
VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs
by: Jiang, Tianxiang, et al.
Published: (2025)
by: Jiang, Tianxiang, et al.
Published: (2025)
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
by: Li, Kunchang, et al.
Published: (2023)
by: Li, Kunchang, et al.
Published: (2023)
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
by: Zeng, Xiangyu, et al.
Published: (2024)
by: Zeng, Xiangyu, et al.
Published: (2024)
VideoChat: Chat-Centric Video Understanding
by: Li, KunChang, et al.
Published: (2023)
by: Li, KunChang, et al.
Published: (2023)
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
by: Wang, Yi, et al.
Published: (2025)
by: Wang, Yi, et al.
Published: (2025)
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
by: Li, Xinhao, et al.
Published: (2025)
by: Li, Xinhao, et al.
Published: (2025)
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
by: Li, Xinhao, et al.
Published: (2024)
by: Li, Xinhao, et al.
Published: (2024)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
by: Li, Kunchang, et al.
Published: (2023)
by: Li, Kunchang, et al.
Published: (2023)
VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning
by: Wang, Zikang, et al.
Published: (2025)
by: Wang, Zikang, et al.
Published: (2025)
OmniVid: A Generative Framework for Universal Video Understanding
by: Wang, Junke, et al.
Published: (2024)
by: Wang, Junke, et al.
Published: (2024)
Harvest Video Foundation Models via Efficient Post-Pretraining
by: Li, Yizhuo, et al.
Published: (2023)
by: Li, Yizhuo, et al.
Published: (2023)
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models
by: Huang, Ziqi, et al.
Published: (2024)
by: Huang, Ziqi, et al.
Published: (2024)
LvBench: A Benchmark for Long-form Video Understanding with Versatile Multi-modal Question Answering
by: Zhang, Hongjie, et al.
Published: (2023)
by: Zhang, Hongjie, et al.
Published: (2023)
InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
by: Wang, Chenting, et al.
Published: (2025)
by: Wang, Chenting, et al.
Published: (2025)
Oogiri-Master: Benchmarking Humor Understanding via Oogiri
by: Murakami, Soichiro, et al.
Published: (2025)
by: Murakami, Soichiro, et al.
Published: (2025)
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
by: Chen, Guo, et al.
Published: (2024)
by: Chen, Guo, et al.
Published: (2024)
Vid-SME: Membership Inference Attacks against Large Video Understanding Models
by: Li, Qi, et al.
Published: (2025)
by: Li, Qi, et al.
Published: (2025)
Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning
by: Zeng, Xiangyu, et al.
Published: (2026)
by: Zeng, Xiangyu, et al.
Published: (2026)
Make Your Training Flexible: Towards Deployment-Efficient Video Models
by: Wang, Chenting, et al.
Published: (2025)
by: Wang, Chenting, et al.
Published: (2025)
VidText: Towards Comprehensive Evaluation for Video Text Understanding
by: Yang, Zhoufaran, et al.
Published: (2025)
by: Yang, Zhoufaran, et al.
Published: (2025)
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
by: Yan, Ziang, et al.
Published: (2024)
by: Yan, Ziang, et al.
Published: (2024)
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis
by: Liang, Feng, et al.
Published: (2023)
by: Liang, Feng, et al.
Published: (2023)
CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models
by: Li, Jingyao, et al.
Published: (2025)
by: Li, Jingyao, et al.
Published: (2025)
EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation
by: Qiu, Zongyang, et al.
Published: (2025)
by: Qiu, Zongyang, et al.
Published: (2025)
Rethinking the Zigzag Flattening for Image Reading
by: Zhao, Qingsong, et al.
Published: (2022)
by: Zhao, Qingsong, et al.
Published: (2022)
VidTwin: Video VAE with Decoupled Structure and Dynamics
by: Wang, Yuchi, et al.
Published: (2024)
by: Wang, Yuchi, et al.
Published: (2024)
TextVidBench: A Benchmark for Long Video Scene Text Understanding
by: Zhong, Yangyang, et al.
Published: (2025)
by: Zhong, Yangyang, et al.
Published: (2025)
KG-Reasoner: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning
by: Wang, Shuai, et al.
Published: (2026)
by: Wang, Shuai, et al.
Published: (2026)
Advancing Cross-domain Discriminability in Continual Learning of Vision-Language Models
by: Xu, Yicheng, et al.
Published: (2024)
by: Xu, Yicheng, et al.
Published: (2024)
SafeVid: Toward Safety Aligned Video Large Multimodal Models
by: Wang, Yixu, et al.
Published: (2025)
by: Wang, Yixu, et al.
Published: (2025)
VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs
by: Yang, Yiming, et al.
Published: (2025)
by: Yang, Yiming, et al.
Published: (2025)
Automatic Answerability Evaluation for Question Generation
by: Wang, Zifan, et al.
Published: (2023)
by: Wang, Zifan, et al.
Published: (2023)
EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation
by: Pei, Baoqi, et al.
Published: (2024)
by: Pei, Baoqi, et al.
Published: (2024)
Taming Recommendation Bias with Causal Intervention on Evolving Personal Popularity
by: Tan, Shiyin, et al.
Published: (2025)
by: Tan, Shiyin, et al.
Published: (2025)
Similar Items
-
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
by: Wang, Yi, et al.
Published: (2023) -
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
by: Yan, Ziang, et al.
Published: (2025) -
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
by: Wang, Yi, et al.
Published: (2024) -
RIVER: A Real-Time Interaction Benchmark for Video LLMs
by: Shi, Yansong, et al.
Published: (2026) -
VideoMamba: State Space Model for Efficient Video Understanding
by: Li, Kunchang, et al.
Published: (2024)