Saved in:
| Main Authors: | Shi, Yanzhao, Zhang, Xiaodan, Ji, Junzhong, Jiang, Haoning, Zheng, Chengxin, Wang, Yinong, Qu, Liangqiong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.09634 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
See Detail Say Clear: Towards Brain CT Report Generation via Pathological Clue-driven Representation Learning
by: Zheng, Chengxin, et al.
Published: (2024)
by: Zheng, Chengxin, et al.
Published: (2024)
MEPNet: Medical Entity-balanced Prompting Network for Brain CT Report Generation
by: Zhang, Xiaodan, et al.
Published: (2025)
by: Zhang, Xiaodan, et al.
Published: (2025)
Residual Denoising Diffusion Models
by: Liu, Jiawei, et al.
Published: (2023)
by: Liu, Jiawei, et al.
Published: (2023)
FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models
by: Zheng, Weiying, et al.
Published: (2025)
by: Zheng, Weiying, et al.
Published: (2025)
ATSTrack: Enhancing Visual-Language Tracking by Aligning Temporal and Spatial Scales
by: Zhen, Yihao, et al.
Published: (2025)
by: Zhen, Yihao, et al.
Published: (2025)
Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models
by: Monon, Mashrafi, et al.
Published: (2026)
by: Monon, Mashrafi, et al.
Published: (2026)
SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models
by: Chen, Pingyi, et al.
Published: (2025)
by: Chen, Pingyi, et al.
Published: (2025)
Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding
by: Cheng, Zhiheng, et al.
Published: (2024)
by: Cheng, Zhiheng, et al.
Published: (2024)
HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models
by: Liang, Huizhi, et al.
Published: (2026)
by: Liang, Huizhi, et al.
Published: (2026)
Exploring Self- and Cross-Triplet Correlations for Human-Object Interaction Detection
by: Jiang, Weibo, et al.
Published: (2024)
by: Jiang, Weibo, et al.
Published: (2024)
SoccerMaster: A Vision Foundation Model for Soccer Understanding
by: Yang, Haolin, et al.
Published: (2025)
by: Yang, Haolin, et al.
Published: (2025)
Deep Generative Models Unveil Patterns in Medical Images Through Vision-Language Conditioning
by: Xing, Xiaodan, et al.
Published: (2024)
by: Xing, Xiaodan, et al.
Published: (2024)
CAGS: Open-Vocabulary 3D Scene Understanding with Context-Aware Gaussian Splatting
by: Sun, Wei, et al.
Published: (2025)
by: Sun, Wei, et al.
Published: (2025)
Grounded 3D-Aware Spatial Vision-Language Modeling
by: Cheng, An-Chieh, et al.
Published: (2026)
by: Cheng, An-Chieh, et al.
Published: (2026)
SpatialBot: Precise Spatial Understanding with Vision Language Models
by: Cai, Wenxiao, et al.
Published: (2024)
by: Cai, Wenxiao, et al.
Published: (2024)
Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment
by: Xie, Xing, et al.
Published: (2025)
by: Xie, Xing, et al.
Published: (2025)
TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding
by: Yang, Fan, et al.
Published: (2026)
by: Yang, Fan, et al.
Published: (2026)
Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models
by: Zhou, Shengli, et al.
Published: (2026)
by: Zhou, Shengli, et al.
Published: (2026)
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
by: Zhou, Shengchao, et al.
Published: (2025)
by: Zhou, Shengchao, et al.
Published: (2025)
Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation
by: Mao, Jiawei, et al.
Published: (2025)
by: Mao, Jiawei, et al.
Published: (2025)
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
by: Song, Chan Hee, et al.
Published: (2024)
by: Song, Chan Hee, et al.
Published: (2024)
3DVLA: Enhancing Vision-Language-Action Models via 3D Spatial and Instance Understanding
by: Xia, Zhongyu, et al.
Published: (2026)
by: Xia, Zhongyu, et al.
Published: (2026)
N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models
by: Wang, Yuxin, et al.
Published: (2025)
by: Wang, Yuxin, et al.
Published: (2025)
Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow
by: Liu, Chengxin, et al.
Published: (2026)
by: Liu, Chengxin, et al.
Published: (2026)
EaqVLA: Encoding-aligned Quantization for Vision-Language-Action Models
by: Jiang, Feng, et al.
Published: (2025)
by: Jiang, Feng, et al.
Published: (2025)
DSKC: Domain Style Modeling with Adaptive Knowledge Consolidation for Exemplar-free Lifelong Person Re-Identification
by: Liu, Shiben, et al.
Published: (2025)
by: Liu, Shiben, et al.
Published: (2025)
Towards Universal Soccer Video Understanding
by: Rao, Jiayuan, et al.
Published: (2024)
by: Rao, Jiayuan, et al.
Published: (2024)
Vision-Language Models Encode Clinical Guidelines for Concept-Based Medical Reasoning
by: Harmanani, Mohamed, et al.
Published: (2026)
by: Harmanani, Mohamed, et al.
Published: (2026)
MedPruner: Training-Free Hierarchical Token Pruning for Efficient 3D Medical Image Understanding in Vision-Language Models
by: Liu, Shengyuan, et al.
Published: (2026)
by: Liu, Shengyuan, et al.
Published: (2026)
ScVLM: Enhancing Vision-Language Model for Safety-Critical Event Understanding
by: Shi, Liang, et al.
Published: (2024)
by: Shi, Liang, et al.
Published: (2024)
Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models
by: Wang, Xiaoyan, et al.
Published: (2025)
by: Wang, Xiaoyan, et al.
Published: (2025)
VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation
by: Dao, Alan, et al.
Published: (2025)
by: Dao, Alan, et al.
Published: (2025)
Break the Visual Perception: Adversarial Attacks Targeting Encoded Visual Tokens of Large Vision-Language Models
by: Wang, Yubo, et al.
Published: (2024)
by: Wang, Yubo, et al.
Published: (2024)
NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving
by: Tian, Kexin, et al.
Published: (2025)
by: Tian, Kexin, et al.
Published: (2025)
Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model
by: Shi, Yiming, et al.
Published: (2024)
by: Shi, Yiming, et al.
Published: (2024)
Distribution-aware Forgetting Compensation for Exemplar-Free Lifelong Person Re-identification
by: Liu, Shiben, et al.
Published: (2025)
by: Liu, Shiben, et al.
Published: (2025)
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
by: Wu, Haoning, et al.
Published: (2024)
by: Wu, Haoning, et al.
Published: (2024)
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
by: Liu, Yifan, et al.
Published: (2025)
by: Liu, Yifan, et al.
Published: (2025)
SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models
by: Zhang, Yue, et al.
Published: (2024)
by: Zhang, Yue, et al.
Published: (2024)
Can 3D Vision-Language Models Truly Understand Natural Language?
by: Deng, Weipeng, et al.
Published: (2024)
by: Deng, Weipeng, et al.
Published: (2024)
Similar Items
-
See Detail Say Clear: Towards Brain CT Report Generation via Pathological Clue-driven Representation Learning
by: Zheng, Chengxin, et al.
Published: (2024) -
MEPNet: Medical Entity-balanced Prompting Network for Brain CT Report Generation
by: Zhang, Xiaodan, et al.
Published: (2025) -
Residual Denoising Diffusion Models
by: Liu, Jiawei, et al.
Published: (2023) -
FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models
by: Zheng, Weiying, et al.
Published: (2025) -
ATSTrack: Enhancing Visual-Language Tracking by Aligning Temporal and Spatial Scales
by: Zhen, Yihao, et al.
Published: (2025)