Saved in:
| Main Authors: | Shi, Xinrui, Liu, Kai, Zhang, Ziqing, Li, Jianze, Li, Anqi, Zhang, Yulun |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.26038 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Dog-IQA: Standard-guided Zero-shot MLLM for Mix-grained Image Quality Assessment
by: Liu, Kai, et al.
Published: (2024)
by: Liu, Kai, et al.
Published: (2024)
VEQ: Modality-Adaptive Quantization for MoE Vision-Language Models
by: Qin, Guangshuo, et al.
Published: (2026)
by: Qin, Guangshuo, et al.
Published: (2026)
SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation
by: Zhang, Hang, et al.
Published: (2024)
by: Zhang, Hang, et al.
Published: (2024)
BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning
by: Ke, Jingyang, et al.
Published: (2026)
by: Ke, Jingyang, et al.
Published: (2026)
CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers
by: Liu, Kai, et al.
Published: (2025)
by: Liu, Kai, et al.
Published: (2025)
UrbanVLA: A Vision-Language-Action Model for Urban Micromobility
by: Li, Anqi, et al.
Published: (2025)
by: Li, Anqi, et al.
Published: (2025)
The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models? A Bias-Controlled Study
by: Zhang, Weichen, et al.
Published: (2025)
by: Zhang, Weichen, et al.
Published: (2025)
TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks
by: Hu, Yuanze, et al.
Published: (2025)
by: Hu, Yuanze, et al.
Published: (2025)
SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments
by: Cao, Yue, et al.
Published: (2024)
by: Cao, Yue, et al.
Published: (2024)
Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT
by: Asfour, Alaa, et al.
Published: (2026)
by: Asfour, Alaa, et al.
Published: (2026)
Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress
by: Zhang, Yuelin, et al.
Published: (2026)
by: Zhang, Yuelin, et al.
Published: (2026)
NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving
by: Tian, Kexin, et al.
Published: (2025)
by: Tian, Kexin, et al.
Published: (2025)
DenseScan: Advancing 3D Scene Understanding with 2D Dense Annotation
by: Wang, Zirui, et al.
Published: (2025)
by: Wang, Zirui, et al.
Published: (2025)
SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation
by: Zhang, Jiwen, et al.
Published: (2026)
by: Zhang, Jiwen, et al.
Published: (2026)
Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models
by: Li, Nanxi, et al.
Published: (2026)
by: Li, Nanxi, et al.
Published: (2026)
HetScene: Heterogeneity-Aware Diffusion for Dense Indoor Scene Generation
by: Chen, Zini, et al.
Published: (2026)
by: Chen, Zini, et al.
Published: (2026)
MagicVL-2B: Empowering Vision-Language Models on Mobile Devices with Lightweight Visual Encoders via Curriculum Learning
by: Liu, Yi, et al.
Published: (2025)
by: Liu, Yi, et al.
Published: (2025)
Evolving Prompt Adaptation for Vision-Language Models
by: Zhang, Enming, et al.
Published: (2026)
by: Zhang, Enming, et al.
Published: (2026)
Diffusion Models in Low-Level Vision: A Survey
by: He, Chunming, et al.
Published: (2024)
by: He, Chunming, et al.
Published: (2024)
LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning
by: Hao, Haihong, et al.
Published: (2026)
by: Hao, Haihong, et al.
Published: (2026)
ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis
by: Zhang, Congzhi, et al.
Published: (2025)
by: Zhang, Congzhi, et al.
Published: (2025)
FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models
by: Tong, Jintao, et al.
Published: (2025)
by: Tong, Jintao, et al.
Published: (2025)
A Hybrid Defense Strategy for Boosting Adversarial Robustness in Vision-Language Models
by: Liang, Yuhan, et al.
Published: (2024)
by: Liang, Yuhan, et al.
Published: (2024)
Fose: Fusion of One-Step Diffusion and End-to-End Network for Pansharpening
by: Liu, Kai, et al.
Published: (2025)
by: Liu, Kai, et al.
Published: (2025)
Improve Vision Language Model Chain-of-thought Reasoning
by: Zhang, Ruohong, et al.
Published: (2024)
by: Zhang, Ruohong, et al.
Published: (2024)
RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding
by: Liu, Hanqing, et al.
Published: (2026)
by: Liu, Hanqing, et al.
Published: (2026)
Asymmetric VAE for One-Step Video Super-Resolution Acceleration
by: Li, Jianze, et al.
Published: (2025)
by: Li, Jianze, et al.
Published: (2025)
The Scene Language: Representing Scenes with Programs, Words, and Embeddings
by: Zhang, Yunzhi, et al.
Published: (2024)
by: Zhang, Yunzhi, et al.
Published: (2024)
SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data
by: Ogezi, Michael, et al.
Published: (2025)
by: Ogezi, Michael, et al.
Published: (2025)
VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning
by: Wang, Chaoyang, et al.
Published: (2026)
by: Wang, Chaoyang, et al.
Published: (2026)
Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Discern Causal Links Across Modalities
by: Li, Zhiyuan, et al.
Published: (2024)
by: Li, Zhiyuan, et al.
Published: (2024)
Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models
by: Gu, Zheyuan, et al.
Published: (2026)
by: Gu, Zheyuan, et al.
Published: (2026)
MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders
by: Lin, Baijiong, et al.
Published: (2024)
by: Lin, Baijiong, et al.
Published: (2024)
Tuning Vision-Language Models with Candidate Labels by Prompt Alignment
by: Zhang, Zhifang, et al.
Published: (2024)
by: Zhang, Zhifang, et al.
Published: (2024)
SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model
by: Shi, Yukai, et al.
Published: (2025)
by: Shi, Yukai, et al.
Published: (2025)
Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models
by: Tan, Huajie, et al.
Published: (2025)
by: Tan, Huajie, et al.
Published: (2025)
PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning
by: Zhang, Yizhen, et al.
Published: (2025)
by: Zhang, Yizhen, et al.
Published: (2025)
P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation
by: Sheng, Kai, et al.
Published: (2026)
by: Sheng, Kai, et al.
Published: (2026)
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
by: Jia, Mengdi, et al.
Published: (2025)
by: Jia, Mengdi, et al.
Published: (2025)
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
by: Wang, Jiayu, et al.
Published: (2024)
by: Wang, Jiayu, et al.
Published: (2024)
Similar Items
-
Dog-IQA: Standard-guided Zero-shot MLLM for Mix-grained Image Quality Assessment
by: Liu, Kai, et al.
Published: (2024) -
VEQ: Modality-Adaptive Quantization for MoE Vision-Language Models
by: Qin, Guangshuo, et al.
Published: (2026) -
SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation
by: Zhang, Hang, et al.
Published: (2024) -
BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning
by: Ke, Jingyang, et al.
Published: (2026) -
CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers
by: Liu, Kai, et al.
Published: (2025)