:: Library Catalog

Buchumschlag

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Yu, Zhuang, Shen, Lei, Zhao, Jing, Sun, Shiliang
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Computer Vision and Pattern Recognition Artificial Intelligence
Online-Zugang:	https://arxiv.org/abs/2601.20705
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Ähnliche Einträge

CountQA: How Well Do MLLMs Count in the Wild?
von: Tamarapalli, Jayant Sravan, et al.
Veröffentlicht: (2025)

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding
von: Wang, Shihao, et al.
Veröffentlicht: (2025)

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
von: Liu, Xiaolin, et al.
Veröffentlicht: (2026)

Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs
von: Ou, Siqu, et al.
Veröffentlicht: (2026)

StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding
von: Lin, Junming, et al.
Veröffentlicht: (2024)

HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks
von: Zhou, Ting, et al.
Veröffentlicht: (2024)

HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding
von: Cai, Yuxuan, et al.
Veröffentlicht: (2025)

How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images
von: Liu, Guimeng, et al.
Veröffentlicht: (2026)

Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
von: Huang, Zhe, et al.
Veröffentlicht: (2025)

Personalized Video Summarization by Multimodal Video Understanding
von: Chen, Brian, et al.
Veröffentlicht: (2024)

LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
von: Sun, Boyuan, et al.
Veröffentlicht: (2025)

VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs
von: Zheng, Naishan, et al.
Veröffentlicht: (2025)

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
von: Zeng, Xiangyu, et al.
Veröffentlicht: (2024)

SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs
von: Alansari, Mohamad, et al.
Veröffentlicht: (2026)

PunchBench: Benchmarking MLLMs in Multimodal Punchline Comprehension
von: Ouyang, Kun, et al.
Veröffentlicht: (2024)

VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction
von: Wang, Hao, et al.
Veröffentlicht: (2025)

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
von: Cai, Mu, et al.
Veröffentlicht: (2024)

ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs
von: Luo, Bingjun, et al.
Veröffentlicht: (2026)

What to Do Next? Memorizing skills from Egocentric Instructional Video
von: Bi, Jing, et al.
Veröffentlicht: (2025)

From Training-Free to Adaptive: Empirical Insights into MLLMs' Understanding of Detection Information
von: Jiao, Qirui, et al.
Veröffentlicht: (2024)

InstrAct: Towards Action-Centric Understanding in Instructional Videos
von: Yang, Zhuoyi, et al.
Veröffentlicht: (2026)

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?
von: Wen, Zimo, et al.
Veröffentlicht: (2026)

LensWalk: Agentic Video Understanding by Planning How You See in Videos
von: Li, Keliang, et al.
Veröffentlicht: (2026)

Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation
von: Mao, Jiawei, et al.
Veröffentlicht: (2025)

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding
von: Tang, Yolo Yunlong, et al.
Veröffentlicht: (2024)

Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models
von: Wei, Yuancheng, et al.
Veröffentlicht: (2026)

VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?
von: Tang, Yolo Y., et al.
Veröffentlicht: (2024)

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
von: Yuan, Yuqian, et al.
Veröffentlicht: (2024)

Towards Visual-Prompt Temporal Answering Grounding in Medical Instructional Video
von: Li, Bin, et al.
Veröffentlicht: (2022)

VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
von: Li, Shicheng, et al.
Veröffentlicht: (2023)

Dense Connector for MLLMs
von: Yao, Huanjin, et al.
Veröffentlicht: (2024)

VCA: Video Curious Agent for Long Video Understanding
von: Yang, Zeyuan, et al.
Veröffentlicht: (2024)

Multimodal Information Fusion for Chart Understanding: A Survey of MLLMs -- Evolution, Limitations, and Cognitive Enhancement
von: Yi, Zhihang, et al.
Veröffentlicht: (2026)

TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment
von: Li, Shicheng, et al.
Veröffentlicht: (2025)

Apollo: An Exploration of Video Understanding in Large Multimodal Models
von: Zohar, Orr, et al.
Veröffentlicht: (2024)

StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs
von: Che, Chang, et al.
Veröffentlicht: (2026)

Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives
von: Jiang, Kai, et al.
Veröffentlicht: (2025)

VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
von: Jang, Lawrence, et al.
Veröffentlicht: (2024)

SurgLLM: A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding
von: Chen, Zhen, et al.
Veröffentlicht: (2025)

One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution
von: Sun, Yujing, et al.
Veröffentlicht: (2025)