Saved in:
| Main Authors: | Tang, Zitian, Krishnan, Rohan Myer, Yu, Zhiqiu, Sun, Chen |
|---|---|
| Format: | Preprint |
| Published: |
2023
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2311.18773 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
How Can Objects Help Video-Language Understanding?
by: Tang, Zitian, et al.
Published: (2025)
by: Tang, Zitian, et al.
Published: (2025)
Progressive Video Condensation with MLLM Agent for Long-form Video Understanding
by: Yin, Yufei, et al.
Published: (2026)
by: Yin, Yufei, et al.
Published: (2026)
Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models
by: Chen, Yuxiao, et al.
Published: (2026)
by: Chen, Yuxiao, et al.
Published: (2026)
TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding
by: Tang, Canhui, et al.
Published: (2025)
by: Tang, Canhui, et al.
Published: (2025)
ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos
by: Guo, Wenliang, et al.
Published: (2025)
by: Guo, Wenliang, et al.
Published: (2025)
Video Token Merging for Long-form Video Understanding
by: Lee, Seon-Ho, et al.
Published: (2024)
by: Lee, Seon-Ho, et al.
Published: (2024)
LvBench: A Benchmark for Long-form Video Understanding with Versatile Multi-modal Question Answering
by: Zhang, Hongjie, et al.
Published: (2023)
by: Zhang, Hongjie, et al.
Published: (2023)
Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow
by: Liu, Ruyang, et al.
Published: (2025)
by: Liu, Ruyang, et al.
Published: (2025)
Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding
by: Wang, Youze, et al.
Published: (2025)
by: Wang, Youze, et al.
Published: (2025)
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
by: Rasheed, Hanoona, et al.
Published: (2025)
by: Rasheed, Hanoona, et al.
Published: (2025)
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
by: Gao, Hongcheng, et al.
Published: (2025)
by: Gao, Hongcheng, et al.
Published: (2025)
Multimodal Language Models for Domain-Specific Procedural Video Summarization
by: Hussain, Nafisa
Published: (2024)
by: Hussain, Nafisa
Published: (2024)
Spacewalker: Traversing Representation Spaces for Fast Interactive Exploration and Annotation of Unstructured Data
by: Heine, Lukas, et al.
Published: (2024)
by: Heine, Lukas, et al.
Published: (2024)
Understanding Long Videos with Multimodal Language Models
by: Ranasinghe, Kanchana, et al.
Published: (2024)
by: Ranasinghe, Kanchana, et al.
Published: (2024)
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs
by: Jiang, Jindong, et al.
Published: (2025)
by: Jiang, Jindong, et al.
Published: (2025)
LVBench: An Extreme Long Video Understanding Benchmark
by: Wang, Weihan, et al.
Published: (2024)
by: Wang, Weihan, et al.
Published: (2024)
VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos
by: Liu, Pengyiang, et al.
Published: (2026)
by: Liu, Pengyiang, et al.
Published: (2026)
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
by: Chen, Seng Nam, et al.
Published: (2026)
by: Chen, Seng Nam, et al.
Published: (2026)
ALLVB: All-in-One Long Video Understanding Benchmark
by: Tan, Xichen, et al.
Published: (2025)
by: Tan, Xichen, et al.
Published: (2025)
MANTA: Cross-Modal Semantic Alignment and Information-Theoretic Optimization for Long-form Multimodal Understanding
by: Zhong, Ziqi, et al.
Published: (2025)
by: Zhong, Ziqi, et al.
Published: (2025)
GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding
by: Chen, Dongping, et al.
Published: (2024)
by: Chen, Dongping, et al.
Published: (2024)
HAVEN: Hierarchically Aligned Multimodal Benchmark for Unified Video Understanding
by: Shi, Mengqi, et al.
Published: (2026)
by: Shi, Mengqi, et al.
Published: (2026)
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
by: Liu, Shuming, et al.
Published: (2025)
by: Liu, Shuming, et al.
Published: (2025)
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
by: Wu, Haoning, et al.
Published: (2024)
by: Wu, Haoning, et al.
Published: (2024)
Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals
by: Wu, Te-Lin, et al.
Published: (2021)
by: Wu, Te-Lin, et al.
Published: (2021)
ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
by: Ma, David, et al.
Published: (2025)
by: Ma, David, et al.
Published: (2025)
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
by: Lu, Hao, et al.
Published: (2025)
by: Lu, Hao, et al.
Published: (2025)
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
by: Chen, Guo, et al.
Published: (2024)
by: Chen, Guo, et al.
Published: (2024)
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding
by: Fang, Xinyu, et al.
Published: (2024)
by: Fang, Xinyu, et al.
Published: (2024)
MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question Answering
by: Song, Seokwon, et al.
Published: (2025)
by: Song, Seokwon, et al.
Published: (2025)
Controllable Hybrid Captioner for Improved Long-form Video Understanding
by: Sasse, Kuleen, et al.
Published: (2025)
by: Sasse, Kuleen, et al.
Published: (2025)
VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding
by: He, Haichen, et al.
Published: (2026)
by: He, Haichen, et al.
Published: (2026)
Anticipating Object State Changes in Long Procedural Videos
by: Manousaki, Victoria, et al.
Published: (2024)
by: Manousaki, Victoria, et al.
Published: (2024)
Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs
by: Zhang, Zicheng, et al.
Published: (2024)
by: Zhang, Zicheng, et al.
Published: (2024)
ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding
by: Wang, Xucheng, et al.
Published: (2026)
by: Wang, Xucheng, et al.
Published: (2026)
Unleashing Hour-Scale Video Training for Long Video-Language Understanding
by: Lin, Jingyang, et al.
Published: (2025)
by: Lin, Jingyang, et al.
Published: (2025)
MR. Video: "MapReduce" is the Principle for Long Video Understanding
by: Pang, Ziqi, et al.
Published: (2025)
by: Pang, Ziqi, et al.
Published: (2025)
VUDG: A Dataset for Video Understanding Domain Generalization
by: Wang, Ziyi, et al.
Published: (2025)
by: Wang, Ziyi, et al.
Published: (2025)
Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding
by: Wang, Shaoguang, et al.
Published: (2026)
by: Wang, Shaoguang, et al.
Published: (2026)
Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism
by: Chen, Tao, et al.
Published: (2026)
by: Chen, Tao, et al.
Published: (2026)
Similar Items
-
How Can Objects Help Video-Language Understanding?
by: Tang, Zitian, et al.
Published: (2025) -
Progressive Video Condensation with MLLM Agent for Long-form Video Understanding
by: Yin, Yufei, et al.
Published: (2026) -
Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models
by: Chen, Yuxiao, et al.
Published: (2026) -
TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding
by: Tang, Canhui, et al.
Published: (2025) -
ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos
by: Guo, Wenliang, et al.
Published: (2025)