:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Weng, Yuetian, Han, Mingfei, He, Haoyu, Chang, Xiaojun, Zhuang, Bohan
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2404.03384
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
by: Han, Mingfei, et al.
Published: (2023)

BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation
by: Zhang, Zeyu, et al.
Published: (2025)

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
by: Xu, Mingze, et al.
Published: (2025)

Streaming Long Video Understanding with Large Language Models
by: Qian, Rui, et al.
Published: (2024)

VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding
by: Yu, Xueqing, et al.
Published: (2026)

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
by: Shen, Xiaoqian, et al.
Published: (2024)

FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion
by: Chen, Zhuokun, et al.
Published: (2026)

Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism
by: Chen, Tao, et al.
Published: (2026)

Understanding Long Videos with Multimodal Language Models
by: Ranasinghe, Kanchana, et al.
Published: (2024)

Order from Chaos: Physical World Understanding from Glitchy Gameplay Videos
by: Cao, Meng, et al.
Published: (2026)

Motion Mamba: Efficient and Long Sequence Motion Generation
by: Zhang, Zeyu, et al.
Published: (2024)

Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models
by: Chen, Yuxiao, et al.
Published: (2026)

TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos
by: Fateh, Fawad Javed, et al.
Published: (2024)

Long Video Understanding with Learnable Retrieval in Video-Language Models
by: Xu, Jiaqi, et al.
Published: (2023)

Language Repository for Long Video Understanding
by: Kahatapitiya, Kumara, et al.
Published: (2024)

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
by: Mao, Weian, et al.
Published: (2026)

Mitigating Data Redundancy to Revitalize Transformer-based Long-Term Time Series Forecasting System
by: Li, Mingjie, et al.
Published: (2022)

Efficient Stitchable Task Adaptation
by: He, Haoyu, et al.
Published: (2023)

Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation
by: Jin, Minghao, et al.
Published: (2026)

WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception
by: Liu, Zhiheng, et al.
Published: (2025)

Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions
by: Zhang, Kecheng, et al.
Published: (2026)

Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding
by: Pereira, Joao, et al.
Published: (2025)

LongVILA: Scaling Long-Context Visual Language Models for Long Videos
by: Chen, Yukang, et al.
Published: (2024)

STORM: Token-Efficient Long Video Understanding for Multimodal LLMs
by: Jiang, Jindong, et al.
Published: (2025)

Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval
by: Chen, Tao, et al.
Published: (2025)

OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs
by: Chen, Feng, et al.
Published: (2025)

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
by: He, Bo, et al.
Published: (2024)

LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models
by: Wei, Hongchen, et al.
Published: (2025)

CogVLM2: Visual Language Models for Image and Video Understanding
by: Hong, Wenyi, et al.
Published: (2024)

Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos
by: Han, Mingfei, et al.
Published: (2026)

BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
by: Liu, Shuming, et al.
Published: (2025)

Towards Event-oriented Long Video Understanding
by: Du, Yifan, et al.
Published: (2024)

ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation
by: Liu, Akide, et al.
Published: (2026)

Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory
by: Gurukar, Saket, et al.
Published: (2025)

PersonaVLM: Long-Term Personalized Multimodal LLMs
by: Nie, Chang, et al.
Published: (2026)

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
by: Ataallah, Kirolos, et al.
Published: (2024)

Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection
by: Han, Mingfei, et al.
Published: (2025)

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification
by: He, Yefei, et al.
Published: (2024)

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
by: Shu, Yan, et al.
Published: (2024)

PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation
by: Li, Xiaolong, et al.
Published: (2025)