Saved in:
| Main Authors: | Huang, Jincai, Zou, Shihao, Guo, Yuchen, Li, Jingjing, Ji, Wei, Wang, Kai, Wang, Shanshan, Si, Weixin |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.13530 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Toward Real-Time Surgical Scene Segmentation via a Spike-Driven Video Transformer with Spike-Informed Pretraining
by: Zou, Shihao, et al.
Published: (2025)
by: Zou, Shihao, et al.
Published: (2025)
TempDiffReg: Temporal Diffusion Model for Non-Rigid 2D-3D Vascular Registration
by: Liu, Zehua, et al.
Published: (2026)
by: Liu, Zehua, et al.
Published: (2026)
3DRS: MLLMs Need 3D-Aware Representation Supervision for Scene Understanding
by: Huang, Xiaohu, et al.
Published: (2025)
by: Huang, Xiaohu, et al.
Published: (2025)
Highly Efficient 3D Human Pose Tracking from Events with Spiking Spatiotemporal Transformer
by: Zou, Shihao, et al.
Published: (2023)
by: Zou, Shihao, et al.
Published: (2023)
SAM3-I: Segment Anything with Instructions
by: Li, Jingjing, et al.
Published: (2025)
by: Li, Jingjing, et al.
Published: (2025)
ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning
by: Xu, Zhengzhuo, et al.
Published: (2025)
by: Xu, Zhengzhuo, et al.
Published: (2025)
Math Blind: Failures in Diagram Understanding Undermine Reasoning in MLLMs
by: Sun, Yanpeng, et al.
Published: (2025)
by: Sun, Yanpeng, et al.
Published: (2025)
Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs
by: Ji, Yikun, et al.
Published: (2025)
by: Ji, Yikun, et al.
Published: (2025)
Towards Holistic Surgical Scene Understanding
by: Valderrama, Natalia, et al.
Published: (2022)
by: Valderrama, Natalia, et al.
Published: (2022)
M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning
by: AI, Inclusion, et al.
Published: (2025)
by: AI, Inclusion, et al.
Published: (2025)
Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs
by: Tong, Jintao, et al.
Published: (2025)
by: Tong, Jintao, et al.
Published: (2025)
T3DM: Test-Time Training-Guided Distribution Shift Modelling for Temporal Knowledge Graph Reasoning
by: Si, Yuehang, et al.
Published: (2025)
by: Si, Yuehang, et al.
Published: (2025)
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
by: Zeng, Xiangyu, et al.
Published: (2024)
by: Zeng, Xiangyu, et al.
Published: (2024)
Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space
by: Yu, Shiyao, et al.
Published: (2025)
by: Yu, Shiyao, et al.
Published: (2025)
Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding
by: Hu, Jingjing, et al.
Published: (2024)
by: Hu, Jingjing, et al.
Published: (2024)
Universal Skeleton Understanding via Differentiable Rendering and MLLMs
by: Wang, Ziyi, et al.
Published: (2026)
by: Wang, Ziyi, et al.
Published: (2026)
Towards Unified Modeling in Federated Multi-Task Learning via Subspace Decoupling
by: Wei, Yipan, et al.
Published: (2025)
by: Wei, Yipan, et al.
Published: (2025)
Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding
by: Wang, Jieyi, et al.
Published: (2026)
by: Wang, Jieyi, et al.
Published: (2026)
Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs
by: Zhang, Shan, et al.
Published: (2025)
by: Zhang, Shan, et al.
Published: (2025)
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding
by: Wang, Yonghui, et al.
Published: (2024)
by: Wang, Yonghui, et al.
Published: (2024)
Unified Modeling of Lane and Lane Topology for Driving Scene Reasoning
by: Li, Han, et al.
Published: (2026)
by: Li, Han, et al.
Published: (2026)
Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs
by: Pan, Jun-Yu, et al.
Published: (2026)
by: Pan, Jun-Yu, et al.
Published: (2026)
Surgical Workflow Recognition and Blocking Effectiveness Detection in Laparoscopic Liver Resections with Pringle Maneuver
by: Guo, Diandian, et al.
Published: (2024)
by: Guo, Diandian, et al.
Published: (2024)
Towards Faithful Reasoning in Comics for Small MLLMs
by: Feng, Chengcheng, et al.
Published: (2026)
by: Feng, Chengcheng, et al.
Published: (2026)
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
by: Li, Caorui, et al.
Published: (2025)
by: Li, Caorui, et al.
Published: (2025)
DermoGPT: Open Weights and Open Data for Morphology-Grounded Dermatological Reasoning MLLMs
by: Ru, Jinghan, et al.
Published: (2026)
by: Ru, Jinghan, et al.
Published: (2026)
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
by: Zhu, Ziyu, et al.
Published: (2025)
by: Zhu, Ziyu, et al.
Published: (2025)
R2G: Reasoning to Ground in 3D Scenes
by: Li, Yixuan, et al.
Published: (2024)
by: Li, Yixuan, et al.
Published: (2024)
Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning
by: Zhang, Bob, et al.
Published: (2025)
by: Zhang, Bob, et al.
Published: (2025)
Towards a Unified Textual Graph Framework for Spectral Reasoning via Physical and Chemical Information Fusion
by: Liang, Jiheng, et al.
Published: (2025)
by: Liang, Jiheng, et al.
Published: (2025)
DeepSVU: Towards In-depth Security-oriented Video Understanding via Unified Physical-world Regularized MoE
by: Jin, Yujie, et al.
Published: (2026)
by: Jin, Yujie, et al.
Published: (2026)
Explicit Relational Reasoning Network for Scene Text Detection
by: Su, Yuchen, et al.
Published: (2024)
by: Su, Yuchen, et al.
Published: (2024)
Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation
by: Lu, Shuo, et al.
Published: (2026)
by: Lu, Shuo, et al.
Published: (2026)
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding
by: Lin, Jingli, et al.
Published: (2025)
by: Lin, Jingli, et al.
Published: (2025)
From Recognition to Reasoning: Benchmarking and Enhancing MLLMs on Real-World Receipt Document Understanding
by: Wang, Yandi, et al.
Published: (2026)
by: Wang, Yandi, et al.
Published: (2026)
Co-Reinforcement Learning for Unified Multimodal Understanding and Generation
by: Jiang, Jingjing, et al.
Published: (2025)
by: Jiang, Jingjing, et al.
Published: (2025)
Xiaoice: Training-Free Video Understanding via Self-Supervised Spatio-Temporal Clustering of Semantic Features
by: Ji, Shihao, et al.
Published: (2025)
by: Ji, Shihao, et al.
Published: (2025)
SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding
by: Drago, Mauro Orazio, et al.
Published: (2025)
by: Drago, Mauro Orazio, et al.
Published: (2025)
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs
by: Lu, Lidong, et al.
Published: (2025)
by: Lu, Lidong, et al.
Published: (2025)
VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding
by: Wang, Shihao, et al.
Published: (2025)
by: Wang, Shihao, et al.
Published: (2025)
Similar Items
-
Toward Real-Time Surgical Scene Segmentation via a Spike-Driven Video Transformer with Spike-Informed Pretraining
by: Zou, Shihao, et al.
Published: (2025) -
TempDiffReg: Temporal Diffusion Model for Non-Rigid 2D-3D Vascular Registration
by: Liu, Zehua, et al.
Published: (2026) -
3DRS: MLLMs Need 3D-Aware Representation Supervision for Scene Understanding
by: Huang, Xiaohu, et al.
Published: (2025) -
Highly Efficient 3D Human Pose Tracking from Events with Spiking Spatiotemporal Transformer
by: Zou, Shihao, et al.
Published: (2023) -
SAM3-I: Segment Anything with Instructions
by: Li, Jingjing, et al.
Published: (2025)