Saved in:
| Main Authors: | Weng, Yuetian, Han, Mingfei, He, Haoyu, Chang, Xiaojun, Zhuang, Bohan |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.03384 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
by: Han, Mingfei, et al.
Published: (2023)
by: Han, Mingfei, et al.
Published: (2023)
BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation
by: Zhang, Zeyu, et al.
Published: (2025)
by: Zhang, Zeyu, et al.
Published: (2025)
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
by: Xu, Mingze, et al.
Published: (2025)
by: Xu, Mingze, et al.
Published: (2025)
Streaming Long Video Understanding with Large Language Models
by: Qian, Rui, et al.
Published: (2024)
by: Qian, Rui, et al.
Published: (2024)
VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding
by: Yu, Xueqing, et al.
Published: (2026)
by: Yu, Xueqing, et al.
Published: (2026)
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
by: Shen, Xiaoqian, et al.
Published: (2024)
by: Shen, Xiaoqian, et al.
Published: (2024)
FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion
by: Chen, Zhuokun, et al.
Published: (2026)
by: Chen, Zhuokun, et al.
Published: (2026)
Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism
by: Chen, Tao, et al.
Published: (2026)
by: Chen, Tao, et al.
Published: (2026)
Understanding Long Videos with Multimodal Language Models
by: Ranasinghe, Kanchana, et al.
Published: (2024)
by: Ranasinghe, Kanchana, et al.
Published: (2024)
Order from Chaos: Physical World Understanding from Glitchy Gameplay Videos
by: Cao, Meng, et al.
Published: (2026)
by: Cao, Meng, et al.
Published: (2026)
Motion Mamba: Efficient and Long Sequence Motion Generation
by: Zhang, Zeyu, et al.
Published: (2024)
by: Zhang, Zeyu, et al.
Published: (2024)
Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models
by: Chen, Yuxiao, et al.
Published: (2026)
by: Chen, Yuxiao, et al.
Published: (2026)
TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos
by: Fateh, Fawad Javed, et al.
Published: (2024)
by: Fateh, Fawad Javed, et al.
Published: (2024)
Long Video Understanding with Learnable Retrieval in Video-Language Models
by: Xu, Jiaqi, et al.
Published: (2023)
by: Xu, Jiaqi, et al.
Published: (2023)
Language Repository for Long Video Understanding
by: Kahatapitiya, Kumara, et al.
Published: (2024)
by: Kahatapitiya, Kumara, et al.
Published: (2024)
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
by: Mao, Weian, et al.
Published: (2026)
by: Mao, Weian, et al.
Published: (2026)
Mitigating Data Redundancy to Revitalize Transformer-based Long-Term Time Series Forecasting System
by: Li, Mingjie, et al.
Published: (2022)
by: Li, Mingjie, et al.
Published: (2022)
Efficient Stitchable Task Adaptation
by: He, Haoyu, et al.
Published: (2023)
by: He, Haoyu, et al.
Published: (2023)
Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation
by: Jin, Minghao, et al.
Published: (2026)
by: Jin, Minghao, et al.
Published: (2026)
WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception
by: Liu, Zhiheng, et al.
Published: (2025)
by: Liu, Zhiheng, et al.
Published: (2025)
Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions
by: Zhang, Kecheng, et al.
Published: (2026)
by: Zhang, Kecheng, et al.
Published: (2026)
Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding
by: Pereira, Joao, et al.
Published: (2025)
by: Pereira, Joao, et al.
Published: (2025)
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
by: Chen, Yukang, et al.
Published: (2024)
by: Chen, Yukang, et al.
Published: (2024)
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs
by: Jiang, Jindong, et al.
Published: (2025)
by: Jiang, Jindong, et al.
Published: (2025)
Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval
by: Chen, Tao, et al.
Published: (2025)
by: Chen, Tao, et al.
Published: (2025)
OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs
by: Chen, Feng, et al.
Published: (2025)
by: Chen, Feng, et al.
Published: (2025)
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
by: He, Bo, et al.
Published: (2024)
by: He, Bo, et al.
Published: (2024)
LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models
by: Wei, Hongchen, et al.
Published: (2025)
by: Wei, Hongchen, et al.
Published: (2025)
CogVLM2: Visual Language Models for Image and Video Understanding
by: Hong, Wenyi, et al.
Published: (2024)
by: Hong, Wenyi, et al.
Published: (2024)
Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos
by: Han, Mingfei, et al.
Published: (2026)
by: Han, Mingfei, et al.
Published: (2026)
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
by: Liu, Shuming, et al.
Published: (2025)
by: Liu, Shuming, et al.
Published: (2025)
Towards Event-oriented Long Video Understanding
by: Du, Yifan, et al.
Published: (2024)
by: Du, Yifan, et al.
Published: (2024)
ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation
by: Liu, Akide, et al.
Published: (2026)
by: Liu, Akide, et al.
Published: (2026)
Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory
by: Gurukar, Saket, et al.
Published: (2025)
by: Gurukar, Saket, et al.
Published: (2025)
PersonaVLM: Long-Term Personalized Multimodal LLMs
by: Nie, Chang, et al.
Published: (2026)
by: Nie, Chang, et al.
Published: (2026)
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
by: Ataallah, Kirolos, et al.
Published: (2024)
by: Ataallah, Kirolos, et al.
Published: (2024)
Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection
by: Han, Mingfei, et al.
Published: (2025)
by: Han, Mingfei, et al.
Published: (2025)
ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification
by: He, Yefei, et al.
Published: (2024)
by: He, Yefei, et al.
Published: (2024)
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
by: Shu, Yan, et al.
Published: (2024)
by: Shu, Yan, et al.
Published: (2024)
PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation
by: Li, Xiaolong, et al.
Published: (2025)
by: Li, Xiaolong, et al.
Published: (2025)
Similar Items
-
Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
by: Han, Mingfei, et al.
Published: (2023) -
BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation
by: Zhang, Zeyu, et al.
Published: (2025) -
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
by: Xu, Mingze, et al.
Published: (2025) -
Streaming Long Video Understanding with Large Language Models
by: Qian, Rui, et al.
Published: (2024) -
VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding
by: Yu, Xueqing, et al.
Published: (2026)