:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Shi, Yanzhao, Zhang, Xiaodan, Ji, Junzhong, Jiang, Haoning, Zheng, Chengxin, Wang, Yinong, Qu, Liangqiong
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2506.09634
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

See Detail Say Clear: Towards Brain CT Report Generation via Pathological Clue-driven Representation Learning
by: Zheng, Chengxin, et al.
Published: (2024)

MEPNet: Medical Entity-balanced Prompting Network for Brain CT Report Generation
by: Zhang, Xiaodan, et al.
Published: (2025)

Residual Denoising Diffusion Models
by: Liu, Jiawei, et al.
Published: (2023)

FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models
by: Zheng, Weiying, et al.
Published: (2025)

ATSTrack: Enhancing Visual-Language Tracking by Aligning Temporal and Spatial Scales
by: Zhen, Yihao, et al.
Published: (2025)

Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models
by: Monon, Mashrafi, et al.
Published: (2026)

SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models
by: Chen, Pingyi, et al.
Published: (2025)

Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding
by: Cheng, Zhiheng, et al.
Published: (2024)

HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models
by: Liang, Huizhi, et al.
Published: (2026)

Exploring Self- and Cross-Triplet Correlations for Human-Object Interaction Detection
by: Jiang, Weibo, et al.
Published: (2024)

SoccerMaster: A Vision Foundation Model for Soccer Understanding
by: Yang, Haolin, et al.
Published: (2025)

Deep Generative Models Unveil Patterns in Medical Images Through Vision-Language Conditioning
by: Xing, Xiaodan, et al.
Published: (2024)

CAGS: Open-Vocabulary 3D Scene Understanding with Context-Aware Gaussian Splatting
by: Sun, Wei, et al.
Published: (2025)

Grounded 3D-Aware Spatial Vision-Language Modeling
by: Cheng, An-Chieh, et al.
Published: (2026)

SpatialBot: Precise Spatial Understanding with Vision Language Models
by: Cai, Wenxiao, et al.
Published: (2024)

Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment
by: Xie, Xing, et al.
Published: (2025)

TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding
by: Yang, Fan, et al.
Published: (2026)

Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models
by: Zhou, Shengli, et al.
Published: (2026)

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
by: Zhou, Shengchao, et al.
Published: (2025)

Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation
by: Mao, Jiawei, et al.
Published: (2025)

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
by: Song, Chan Hee, et al.
Published: (2024)

3DVLA: Enhancing Vision-Language-Action Models via 3D Spatial and Instance Understanding
by: Xia, Zhongyu, et al.
Published: (2026)

N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models
by: Wang, Yuxin, et al.
Published: (2025)

Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow
by: Liu, Chengxin, et al.
Published: (2026)

EaqVLA: Encoding-aligned Quantization for Vision-Language-Action Models
by: Jiang, Feng, et al.
Published: (2025)

DSKC: Domain Style Modeling with Adaptive Knowledge Consolidation for Exemplar-free Lifelong Person Re-Identification
by: Liu, Shiben, et al.
Published: (2025)

Towards Universal Soccer Video Understanding
by: Rao, Jiayuan, et al.
Published: (2024)

Vision-Language Models Encode Clinical Guidelines for Concept-Based Medical Reasoning
by: Harmanani, Mohamed, et al.
Published: (2026)

MedPruner: Training-Free Hierarchical Token Pruning for Efficient 3D Medical Image Understanding in Vision-Language Models
by: Liu, Shengyuan, et al.
Published: (2026)

ScVLM: Enhancing Vision-Language Model for Safety-Critical Event Understanding
by: Shi, Liang, et al.
Published: (2024)

Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models
by: Wang, Xiaoyan, et al.
Published: (2025)

VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation
by: Dao, Alan, et al.
Published: (2025)

Break the Visual Perception: Adversarial Attacks Targeting Encoded Visual Tokens of Large Vision-Language Models
by: Wang, Yubo, et al.
Published: (2024)

NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving
by: Tian, Kexin, et al.
Published: (2025)

Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model
by: Shi, Yiming, et al.
Published: (2024)

Distribution-aware Forgetting Compensation for Exemplar-Free Lifelong Person Re-identification
by: Liu, Shiben, et al.
Published: (2025)

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
by: Wu, Haoning, et al.
Published: (2024)

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
by: Liu, Yifan, et al.
Published: (2025)

SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models
by: Zhang, Yue, et al.
Published: (2024)

Can 3D Vision-Language Models Truly Understand Natural Language?
by: Deng, Weipeng, et al.
Published: (2024)