:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Tang, Zitian, Krishnan, Rohan Myer, Yu, Zhiqiu, Sun, Chen
Format:	Preprint
Published:	2023
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2311.18773
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

How Can Objects Help Video-Language Understanding?
by: Tang, Zitian, et al.
Published: (2025)

Progressive Video Condensation with MLLM Agent for Long-form Video Understanding
by: Yin, Yufei, et al.
Published: (2026)

Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models
by: Chen, Yuxiao, et al.
Published: (2026)

TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding
by: Tang, Canhui, et al.
Published: (2025)

ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos
by: Guo, Wenliang, et al.
Published: (2025)

Video Token Merging for Long-form Video Understanding
by: Lee, Seon-Ho, et al.
Published: (2024)

LvBench: A Benchmark for Long-form Video Understanding with Versatile Multi-modal Question Answering
by: Zhang, Hongjie, et al.
Published: (2023)

Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow
by: Liu, Ruyang, et al.
Published: (2025)

Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding
by: Wang, Youze, et al.
Published: (2025)

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
by: Rasheed, Hanoona, et al.
Published: (2025)

Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
by: Gao, Hongcheng, et al.
Published: (2025)

Multimodal Language Models for Domain-Specific Procedural Video Summarization
by: Hussain, Nafisa
Published: (2024)

Spacewalker: Traversing Representation Spaces for Fast Interactive Exploration and Annotation of Unstructured Data
by: Heine, Lukas, et al.
Published: (2024)

Understanding Long Videos with Multimodal Language Models
by: Ranasinghe, Kanchana, et al.
Published: (2024)

STORM: Token-Efficient Long Video Understanding for Multimodal LLMs
by: Jiang, Jindong, et al.
Published: (2025)

LVBench: An Extreme Long Video Understanding Benchmark
by: Wang, Weihan, et al.
Published: (2024)

VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos
by: Liu, Pengyiang, et al.
Published: (2026)

Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
by: Chen, Seng Nam, et al.
Published: (2026)

ALLVB: All-in-One Long Video Understanding Benchmark
by: Tan, Xichen, et al.
Published: (2025)

MANTA: Cross-Modal Semantic Alignment and Information-Theoretic Optimization for Long-form Multimodal Understanding
by: Zhong, Ziqi, et al.
Published: (2025)

GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding
by: Chen, Dongping, et al.
Published: (2024)

HAVEN: Hierarchically Aligned Multimodal Benchmark for Unified Video Understanding
by: Shi, Mengqi, et al.
Published: (2026)

BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
by: Liu, Shuming, et al.
Published: (2025)

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
by: Wu, Haoning, et al.
Published: (2024)

Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals
by: Wu, Te-Lin, et al.
Published: (2021)

ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
by: Ma, David, et al.
Published: (2025)

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
by: Lu, Hao, et al.
Published: (2025)

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
by: Chen, Guo, et al.
Published: (2024)

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding
by: Fang, Xinyu, et al.
Published: (2024)

MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question Answering
by: Song, Seokwon, et al.
Published: (2025)

Controllable Hybrid Captioner for Improved Long-form Video Understanding
by: Sasse, Kuleen, et al.
Published: (2025)

VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding
by: He, Haichen, et al.
Published: (2026)

Anticipating Object State Changes in Long Procedural Videos
by: Manousaki, Victoria, et al.
Published: (2024)

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs
by: Zhang, Zicheng, et al.
Published: (2024)

ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding
by: Wang, Xucheng, et al.
Published: (2026)

Unleashing Hour-Scale Video Training for Long Video-Language Understanding
by: Lin, Jingyang, et al.
Published: (2025)

MR. Video: "MapReduce" is the Principle for Long Video Understanding
by: Pang, Ziqi, et al.
Published: (2025)

VUDG: A Dataset for Video Understanding Domain Generalization
by: Wang, Ziyi, et al.
Published: (2025)

Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding
by: Wang, Shaoguang, et al.
Published: (2026)

Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism
by: Chen, Tao, et al.
Published: (2026)