Saved in:
| Main Authors: | Liu, Xianjie, Hu, Yiman, Wu, Liang, Hu, Ping, Zou, Yixiong, Xu, Jian, Zheng, Bo |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.08355 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling
by: Liu, Xianjie, et al.
Published: (2025)
by: Liu, Xianjie, et al.
Published: (2025)
Flatten Long-Range Loss Landscapes for Cross-Domain Few-Shot Learning
by: Zou, Yixiong, et al.
Published: (2024)
by: Zou, Yixiong, et al.
Published: (2024)
HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding
by: Cai, Yuxuan, et al.
Published: (2025)
by: Cai, Yuxuan, et al.
Published: (2025)
Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs
by: Tong, Jintao, et al.
Published: (2025)
by: Tong, Jintao, et al.
Published: (2025)
MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding
by: Nie, Zhanheng, et al.
Published: (2025)
by: Nie, Zhanheng, et al.
Published: (2025)
MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding
by: Wu, Junxian, et al.
Published: (2026)
by: Wu, Junxian, et al.
Published: (2026)
MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding
by: Zhang, Daoze, et al.
Published: (2025)
by: Zhang, Daoze, et al.
Published: (2025)
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding
by: Zhang, Yuanhan, et al.
Published: (2025)
by: Zhang, Yuanhan, et al.
Published: (2025)
EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs
by: He, Yuping, et al.
Published: (2025)
by: He, Yuping, et al.
Published: (2025)
PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs
by: Zhang, Zixin, et al.
Published: (2025)
by: Zhang, Zixin, et al.
Published: (2025)
SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs
by: Tong, Jintao, et al.
Published: (2026)
by: Tong, Jintao, et al.
Published: (2026)
Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly
by: Liu, Yexin, et al.
Published: (2024)
by: Liu, Yexin, et al.
Published: (2024)
STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
by: Li, Yun, et al.
Published: (2025)
by: Li, Yun, et al.
Published: (2025)
D&M: Enriching E-commerce Videos with Sound Effects by Key Moment Detection and SFX Matching
by: Liu, Jingyu, et al.
Published: (2024)
by: Liu, Jingyu, et al.
Published: (2024)
MLVU: Benchmarking Multi-task Long Video Understanding
by: Zhou, Junjie, et al.
Published: (2024)
by: Zhou, Junjie, et al.
Published: (2024)
GOBench: Benchmarking Geometric Optics Generation and Understanding of MLLMs
by: Zhu, Xiaorong, et al.
Published: (2025)
by: Zhu, Xiaorong, et al.
Published: (2025)
STORM: Benchmarking Visual Rating of MLLMs with a Comprehensive Ordinal Regression Dataset
by: Wang, Jinhong, et al.
Published: (2025)
by: Wang, Jinhong, et al.
Published: (2025)
Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders
by: Fang, Bo, et al.
Published: (2025)
by: Fang, Bo, et al.
Published: (2025)
MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness
by: Tang, Yolo Y., et al.
Published: (2025)
by: Tang, Yolo Y., et al.
Published: (2025)
Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention
by: Zou, Xin, et al.
Published: (2025)
by: Zou, Xin, et al.
Published: (2025)
MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence
by: Lin, Jingli, et al.
Published: (2025)
by: Lin, Jingli, et al.
Published: (2025)
Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs
by: Zhang, Gengyuan, et al.
Published: (2025)
by: Zhang, Gengyuan, et al.
Published: (2025)
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
by: Ouyang, Kun, et al.
Published: (2025)
by: Ouyang, Kun, et al.
Published: (2025)
Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs
by: Zhu, Rui, et al.
Published: (2026)
by: Zhu, Rui, et al.
Published: (2026)
StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding
by: Lin, Junming, et al.
Published: (2024)
by: Lin, Junming, et al.
Published: (2024)
MOON Embedding: Multimodal Representation Learning for E-commerce Search Advertising
by: Fu, Chenghan, et al.
Published: (2025)
by: Fu, Chenghan, et al.
Published: (2025)
X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding
by: Sun, Peiwen, et al.
Published: (2026)
by: Sun, Peiwen, et al.
Published: (2026)
Adapting Vision-Language Models for E-commerce Understanding at Scale
by: Nulli, Matteo, et al.
Published: (2026)
by: Nulli, Matteo, et al.
Published: (2026)
M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding
by: Jiang, Juntao, et al.
Published: (2026)
by: Jiang, Juntao, et al.
Published: (2026)
Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
by: Huang, Zhe, et al.
Published: (2025)
by: Huang, Zhe, et al.
Published: (2025)
HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks
by: Zhou, Ting, et al.
Published: (2024)
by: Zhou, Ting, et al.
Published: (2024)
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding
by: Yang, Zhenyu, et al.
Published: (2025)
by: Yang, Zhenyu, et al.
Published: (2025)
Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding
by: Wang, Youze, et al.
Published: (2025)
by: Wang, Youze, et al.
Published: (2025)
Benchmarking Large and Small MLLMs
by: Feng, Xuelu, et al.
Published: (2025)
by: Feng, Xuelu, et al.
Published: (2025)
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
by: Wu, Xin, et al.
Published: (2026)
by: Wu, Xin, et al.
Published: (2026)
Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs
by: Zhang, Shan, et al.
Published: (2025)
by: Zhang, Shan, et al.
Published: (2025)
Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding
by: Hu, Pengfei, et al.
Published: (2025)
by: Hu, Pengfei, et al.
Published: (2025)
Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation
by: Liang, Tianming, et al.
Published: (2025)
by: Liang, Tianming, et al.
Published: (2025)
DreamPainter: Image Background Inpainting for E-commerce Scenarios
by: Zhao, Sijie, et al.
Published: (2025)
by: Zhao, Sijie, et al.
Published: (2025)
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
by: Cheng, Zixu, et al.
Published: (2025)
by: Cheng, Zixu, et al.
Published: (2025)
Similar Items
-
HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling
by: Liu, Xianjie, et al.
Published: (2025) -
Flatten Long-Range Loss Landscapes for Cross-Domain Few-Shot Learning
by: Zou, Yixiong, et al.
Published: (2024) -
HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding
by: Cai, Yuxuan, et al.
Published: (2025) -
Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs
by: Tong, Jintao, et al.
Published: (2025) -
MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding
by: Nie, Zhanheng, et al.
Published: (2025)