:: Library Catalog

Buchumschlag

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Shi, Yansong, Zhao, Qingsong, Jiang, Tianxiang, Zeng, Xiangyu, Wang, Yi, Wang, Limin
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Computer Vision and Pattern Recognition
Online-Zugang:	https://arxiv.org/abs/2603.03985
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Ähnliche Einträge

Make Your Training Flexible: Towards Deployment-Efficient Video Models
von: Wang, Chenting, et al.
Veröffentlicht: (2025)

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
von: Zeng, Xiangyu, et al.
Veröffentlicht: (2024)

VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs
von: Jiang, Tianxiang, et al.
Veröffentlicht: (2025)

ExpVid: A Benchmark for Experiment Video Understanding & Reasoning
von: Xu, Yicheng, et al.
Veröffentlicht: (2025)

Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning
von: Zeng, Xiangyu, et al.
Veröffentlicht: (2026)

VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
von: Yan, Ziang, et al.
Veröffentlicht: (2025)

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
von: Wang, Yi, et al.
Veröffentlicht: (2024)

Flash-VStream: Efficient Real-Time Understanding for Long Video Streams
von: Zhang, Haoji, et al.
Veröffentlicht: (2025)

Rethinking the Zigzag Flattening for Image Reading
von: Zhao, Qingsong, et al.
Veröffentlicht: (2022)

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
von: Zhang, Haoji, et al.
Veröffentlicht: (2024)

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
von: Li, Kunchang, et al.
Veröffentlicht: (2023)

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
von: Li, Xinhao, et al.
Veröffentlicht: (2025)

StreamForest: Efficient Online Video Understanding with Persistent Event Memory
von: Zeng, Xiangyu, et al.
Veröffentlicht: (2025)

VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model
von: Li, Xinhao, et al.
Veröffentlicht: (2024)

RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control
von: Xu, Youcan, et al.
Veröffentlicht: (2026)

VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations
von: Dong, Lu, et al.
Veröffentlicht: (2025)

Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation
von: Zhu, Tianrui, et al.
Veröffentlicht: (2025)

End-to-End Dense Video Grounding via Parallel Regression
von: Shi, Fengyuan, et al.
Veröffentlicht: (2021)

Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs
von: Zhou, Wenrui, et al.
Veröffentlicht: (2025)

LvBench: A Benchmark for Long-form Video Understanding with Versatile Multi-modal Question Answering
von: Zhang, Hongjie, et al.
Veröffentlicht: (2023)

SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos
von: Wu, Tao, et al.
Veröffentlicht: (2024)

CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning
von: Yang, Jiange, et al.
Veröffentlicht: (2025)

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
von: Qian, Rui, et al.
Veröffentlicht: (2025)

StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering
von: Xie, Ming, et al.
Veröffentlicht: (2026)

VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning
von: Wang, Zikang, et al.
Veröffentlicht: (2025)

HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction
von: Shi, Zhonghao, et al.
Veröffentlicht: (2025)

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
von: Zhang, Jun, et al.
Veröffentlicht: (2025)

SpriteHand: Real-Time Versatile Hand-Object Interaction with Autoregressive Video Generation
von: Li, Zisu, et al.
Veröffentlicht: (2025)

ProactiveVideoQA: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models
von: Wang, Yueqian, et al.
Veröffentlicht: (2025)

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
von: Wang, Yi, et al.
Veröffentlicht: (2025)

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
von: Li, Xinhao, et al.
Veröffentlicht: (2024)

Online Video Understanding: OVBench and VideoChat-Online
von: Huang, Zhenpeng, et al.
Veröffentlicht: (2024)

Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding
von: Zeng, Tong, et al.
Veröffentlicht: (2025)

ALIVE: An Avatar-Lecture Interactive Video Engine with Content-Aware Retrieval for Real-Time Interaction
von: Islam, Md Zabirul, et al.
Veröffentlicht: (2025)

UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation
von: Yue, Zhengrong, et al.
Veröffentlicht: (2025)

VideoMamba: State Space Model for Efficient Video Understanding
von: Li, Kunchang, et al.
Veröffentlicht: (2024)

Lost in Time: A New Temporal Benchmark for VideoLLMs
von: Cores, Daniel, et al.
Veröffentlicht: (2024)

FreeRet: MLLMs as Training-Free Retrievers
von: Zhu, Yuhan, et al.
Veröffentlicht: (2025)

Sparse Global Matching for Video Frame Interpolation with Large Motion
von: Liu, Chunxu, et al.
Veröffentlicht: (2024)

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
von: Wang, Yuxuan, et al.
Veröffentlicht: (2025)