:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Huang, Jincai, Zou, Shihao, Guo, Yuchen, Li, Jingjing, Ji, Wei, Wang, Kai, Wang, Shanshan, Si, Weixin
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.13530
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Toward Real-Time Surgical Scene Segmentation via a Spike-Driven Video Transformer with Spike-Informed Pretraining
by: Zou, Shihao, et al.
Published: (2025)

TempDiffReg: Temporal Diffusion Model for Non-Rigid 2D-3D Vascular Registration
by: Liu, Zehua, et al.
Published: (2026)

3DRS: MLLMs Need 3D-Aware Representation Supervision for Scene Understanding
by: Huang, Xiaohu, et al.
Published: (2025)

Highly Efficient 3D Human Pose Tracking from Events with Spiking Spatiotemporal Transformer
by: Zou, Shihao, et al.
Published: (2023)

SAM3-I: Segment Anything with Instructions
by: Li, Jingjing, et al.
Published: (2025)

ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning
by: Xu, Zhengzhuo, et al.
Published: (2025)

Math Blind: Failures in Diagram Understanding Undermine Reasoning in MLLMs
by: Sun, Yanpeng, et al.
Published: (2025)

Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs
by: Ji, Yikun, et al.
Published: (2025)

Towards Holistic Surgical Scene Understanding
by: Valderrama, Natalia, et al.
Published: (2022)

M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning
by: AI, Inclusion, et al.
Published: (2025)

Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs
by: Tong, Jintao, et al.
Published: (2025)

T3DM: Test-Time Training-Guided Distribution Shift Modelling for Temporal Knowledge Graph Reasoning
by: Si, Yuehang, et al.
Published: (2025)

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
by: Zeng, Xiangyu, et al.
Published: (2024)

Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space
by: Yu, Shiyao, et al.
Published: (2025)

Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding
by: Hu, Jingjing, et al.
Published: (2024)

Universal Skeleton Understanding via Differentiable Rendering and MLLMs
by: Wang, Ziyi, et al.
Published: (2026)

Towards Unified Modeling in Federated Multi-Task Learning via Subspace Decoupling
by: Wei, Yipan, et al.
Published: (2025)

Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding
by: Wang, Jieyi, et al.
Published: (2026)

Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs
by: Zhang, Shan, et al.
Published: (2025)

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding
by: Wang, Yonghui, et al.
Published: (2024)

Unified Modeling of Lane and Lane Topology for Driving Scene Reasoning
by: Li, Han, et al.
Published: (2026)

Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs
by: Pan, Jun-Yu, et al.
Published: (2026)

Surgical Workflow Recognition and Blocking Effectiveness Detection in Laparoscopic Liver Resections with Pringle Maneuver
by: Guo, Diandian, et al.
Published: (2024)

Towards Faithful Reasoning in Comics for Small MLLMs
by: Feng, Chengcheng, et al.
Published: (2026)

OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
by: Li, Caorui, et al.
Published: (2025)

DermoGPT: Open Weights and Open Data for Morphology-Grounded Dermatological Reasoning MLLMs
by: Ru, Jinghan, et al.
Published: (2026)

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
by: Zhu, Ziyu, et al.
Published: (2025)

R2G: Reasoning to Ground in 3D Scenes
by: Li, Yixuan, et al.
Published: (2024)

Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning
by: Zhang, Bob, et al.
Published: (2025)

Towards a Unified Textual Graph Framework for Spectral Reasoning via Physical and Chemical Information Fusion
by: Liang, Jiheng, et al.
Published: (2025)

DeepSVU: Towards In-depth Security-oriented Video Understanding via Unified Physical-world Regularized MoE
by: Jin, Yujie, et al.
Published: (2026)

Explicit Relational Reasoning Network for Scene Text Detection
by: Su, Yuchen, et al.
Published: (2024)

Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation
by: Lu, Shuo, et al.
Published: (2026)

OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding
by: Lin, Jingli, et al.
Published: (2025)

From Recognition to Reasoning: Benchmarking and Enhancing MLLMs on Real-World Receipt Document Understanding
by: Wang, Yandi, et al.
Published: (2026)

Co-Reinforcement Learning for Unified Multimodal Understanding and Generation
by: Jiang, Jingjing, et al.
Published: (2025)

Xiaoice: Training-Free Video Understanding via Self-Supervised Spatio-Temporal Clustering of Semantic Features
by: Ji, Shihao, et al.
Published: (2025)

SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding
by: Drago, Mauro Orazio, et al.
Published: (2025)

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs
by: Lu, Lidong, et al.
Published: (2025)

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding
by: Wang, Shihao, et al.
Published: (2025)