:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Liu, Xianjie, Hu, Yiman, Wu, Liang, Hu, Ping, Zou, Yixiong, Xu, Jian, Zheng, Bo
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2602.08355
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling
by: Liu, Xianjie, et al.
Published: (2025)

Flatten Long-Range Loss Landscapes for Cross-Domain Few-Shot Learning
by: Zou, Yixiong, et al.
Published: (2024)

HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding
by: Cai, Yuxuan, et al.
Published: (2025)

Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs
by: Tong, Jintao, et al.
Published: (2025)

MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding
by: Nie, Zhanheng, et al.
Published: (2025)

MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding
by: Wu, Junxian, et al.
Published: (2026)

MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding
by: Zhang, Daoze, et al.
Published: (2025)

Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding
by: Zhang, Yuanhan, et al.
Published: (2025)

EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs
by: He, Yuping, et al.
Published: (2025)

PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs
by: Zhang, Zixin, et al.
Published: (2025)

SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs
by: Tong, Jintao, et al.
Published: (2026)

Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly
by: Liu, Yexin, et al.
Published: (2024)

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
by: Li, Yun, et al.
Published: (2025)

D&M: Enriching E-commerce Videos with Sound Effects by Key Moment Detection and SFX Matching
by: Liu, Jingyu, et al.
Published: (2024)

MLVU: Benchmarking Multi-task Long Video Understanding
by: Zhou, Junjie, et al.
Published: (2024)

GOBench: Benchmarking Geometric Optics Generation and Understanding of MLLMs
by: Zhu, Xiaorong, et al.
Published: (2025)

STORM: Benchmarking Visual Rating of MLLMs with a Comprehensive Ordinal Regression Dataset
by: Wang, Jinhong, et al.
Published: (2025)

Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders
by: Fang, Bo, et al.
Published: (2025)

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness
by: Tang, Yolo Y., et al.
Published: (2025)

Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention
by: Zou, Xin, et al.
Published: (2025)

MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence
by: Lin, Jingli, et al.
Published: (2025)

Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs
by: Zhang, Gengyuan, et al.
Published: (2025)

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
by: Ouyang, Kun, et al.
Published: (2025)

Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs
by: Zhu, Rui, et al.
Published: (2026)

StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding
by: Lin, Junming, et al.
Published: (2024)

MOON Embedding: Multimodal Representation Learning for E-commerce Search Advertising
by: Fu, Chenghan, et al.
Published: (2025)

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding
by: Sun, Peiwen, et al.
Published: (2026)

Adapting Vision-Language Models for E-commerce Understanding at Scale
by: Nulli, Matteo, et al.
Published: (2026)

M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding
by: Jiang, Juntao, et al.
Published: (2026)

Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
by: Huang, Zhe, et al.
Published: (2025)

HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks
by: Zhou, Ting, et al.
Published: (2024)

SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding
by: Yang, Zhenyu, et al.
Published: (2025)

Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding
by: Wang, Youze, et al.
Published: (2025)

Benchmarking Large and Small MLLMs
by: Feng, Xuelu, et al.
Published: (2025)

ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
by: Wu, Xin, et al.
Published: (2026)

Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs
by: Zhang, Shan, et al.
Published: (2025)

Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding
by: Hu, Pengfei, et al.
Published: (2025)

Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation
by: Liang, Tianming, et al.
Published: (2025)

DreamPainter: Image Background Inpainting for E-commerce Scenarios
by: Zhao, Sijie, et al.
Published: (2025)

V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
by: Cheng, Zixu, et al.
Published: (2025)