:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Shi, Xinrui, Liu, Kai, Zhang, Ziqing, Li, Jianze, Li, Anqi, Zhang, Yulun
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.26038
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Dog-IQA: Standard-guided Zero-shot MLLM for Mix-grained Image Quality Assessment
by: Liu, Kai, et al.
Published: (2024)

VEQ: Modality-Adaptive Quantization for MoE Vision-Language Models
by: Qin, Guangshuo, et al.
Published: (2026)

SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation
by: Zhang, Hang, et al.
Published: (2024)

BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning
by: Ke, Jingyang, et al.
Published: (2026)

CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers
by: Liu, Kai, et al.
Published: (2025)

UrbanVLA: A Vision-Language-Action Model for Urban Micromobility
by: Li, Anqi, et al.
Published: (2025)

The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models? A Bias-Controlled Study
by: Zhang, Weichen, et al.
Published: (2025)

TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks
by: Hu, Yuanze, et al.
Published: (2025)

SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments
by: Cao, Yue, et al.
Published: (2024)

Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT
by: Asfour, Alaa, et al.
Published: (2026)

Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress
by: Zhang, Yuelin, et al.
Published: (2026)

NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving
by: Tian, Kexin, et al.
Published: (2025)

DenseScan: Advancing 3D Scene Understanding with 2D Dense Annotation
by: Wang, Zirui, et al.
Published: (2025)

SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation
by: Zhang, Jiwen, et al.
Published: (2026)

Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models
by: Li, Nanxi, et al.
Published: (2026)

HetScene: Heterogeneity-Aware Diffusion for Dense Indoor Scene Generation
by: Chen, Zini, et al.
Published: (2026)

MagicVL-2B: Empowering Vision-Language Models on Mobile Devices with Lightweight Visual Encoders via Curriculum Learning
by: Liu, Yi, et al.
Published: (2025)

Evolving Prompt Adaptation for Vision-Language Models
by: Zhang, Enming, et al.
Published: (2026)

Diffusion Models in Low-Level Vision: A Survey
by: He, Chunming, et al.
Published: (2024)

LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning
by: Hao, Haihong, et al.
Published: (2026)

ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis
by: Zhang, Congzhi, et al.
Published: (2025)

FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models
by: Tong, Jintao, et al.
Published: (2025)

A Hybrid Defense Strategy for Boosting Adversarial Robustness in Vision-Language Models
by: Liang, Yuhan, et al.
Published: (2024)

Fose: Fusion of One-Step Diffusion and End-to-End Network for Pansharpening
by: Liu, Kai, et al.
Published: (2025)

Improve Vision Language Model Chain-of-thought Reasoning
by: Zhang, Ruohong, et al.
Published: (2024)

RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding
by: Liu, Hanqing, et al.
Published: (2026)

Asymmetric VAE for One-Step Video Super-Resolution Acceleration
by: Li, Jianze, et al.
Published: (2025)

The Scene Language: Representing Scenes with Programs, Words, and Embeddings
by: Zhang, Yunzhi, et al.
Published: (2024)

SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data
by: Ogezi, Michael, et al.
Published: (2025)

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning
by: Wang, Chaoyang, et al.
Published: (2026)

Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Discern Causal Links Across Modalities
by: Li, Zhiyuan, et al.
Published: (2024)

Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models
by: Gu, Zheyuan, et al.
Published: (2026)

MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders
by: Lin, Baijiong, et al.
Published: (2024)

Tuning Vision-Language Models with Candidate Labels by Prompt Alignment
by: Zhang, Zhifang, et al.
Published: (2024)

SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model
by: Shi, Yukai, et al.
Published: (2025)

Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models
by: Tan, Huajie, et al.
Published: (2025)

PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning
by: Zhang, Yizhen, et al.
Published: (2025)

P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation
by: Sheng, Kai, et al.
Published: (2026)

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
by: Jia, Mengdi, et al.
Published: (2025)

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
by: Wang, Jiayu, et al.
Published: (2024)