:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Bai, Weimin, Li, Yubo, Luo, Weijian, Lai, Zeqiang, Wang, Yequan, Chen, Wenzheng, Sun, He
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2511.14271
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation
by: Bai, Weimin, et al.
Published: (2025)

Dive3D: Diverse Distillation-based Text-to-3D Generation via Score Implicit Matching
by: Bai, Weimin, et al.
Published: (2025)

Integrating Amortized Inference with Diffusion Models for Learning Clean Distribution from Corrupted Images
by: Wang, Yifei, et al.
Published: (2024)

An Expectation-Maximization Algorithm for Training Clean Diffusion Models from Corrupted Observations
by: Bai, Weimin, et al.
Published: (2024)

Blind Inversion using Latent Diffusion Priors
by: Bai, Weimin, et al.
Published: (2024)

Unbiased Diffusion Variational Inversion via Principled Posterior Matching
by: Bai, Weimin, et al.
Published: (2026)

Uni-Instruct: One-step Diffusion Model through Unified Diffusion Divergence Instruction
by: Wang, Yifei, et al.
Published: (2025)

Learning Diffusion Model from Noisy Measurement using Principled Expectation-Maximization Method
by: Bai, Weimin, et al.
Published: (2024)

Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation
by: Ma, Weijian, et al.
Published: (2026)

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
by: Wu, Jiannan, et al.
Published: (2024)

Dia-LLaMA: Towards Large Language Model-driven CT Report Generation
by: Chen, Zhixuan, et al.
Published: (2024)

InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior
by: Bai, Weimin, et al.
Published: (2025)

Continual Learning with Vision-Language Models via Semantic-Geometry Preservation
by: He, Chiyuan, et al.
Published: (2026)

SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models
by: Zhao, Ruosen, et al.
Published: (2025)

Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes
by: Gholami, Mohsen, et al.
Published: (2025)

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
by: Xu, Guowei, et al.
Published: (2024)

3D-GPT: Procedural 3D Modeling with Large Language Models
by: Sun, Chunyi, et al.
Published: (2023)

Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
by: Qi, Jianing, et al.
Published: (2025)

Grounded 3D-Aware Spatial Vision-Language Modeling
by: Cheng, An-Chieh, et al.
Published: (2026)

Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models
by: Monon, Mashrafi, et al.
Published: (2026)

Unaligned RGB Guided Hyperspectral Image Super-Resolution with Spatial-Spectral Concordance
by: Zhang, Yingkai, et al.
Published: (2025)

Large Language Model with Region-guided Referring and Grounding for CT Report Generation
by: Chen, Zhixuan, et al.
Published: (2024)

HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models
by: Liang, Huizhi, et al.
Published: (2026)

Spatial-aware Vision Language Model for Autonomous Driving
by: Wei, Weijie, et al.
Published: (2025)

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
by: Liu, Yifan, et al.
Published: (2025)

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks
by: Xie, Peng, et al.
Published: (2024)

Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
by: Jiang, Jerry, et al.
Published: (2026)

Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models
by: Wang, Xiaoyan, et al.
Published: (2025)

Backdoor Attack on Vision Language Models with Stealthy Semantic Manipulation
by: Zhong, Zhiyuan, et al.
Published: (2025)

SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
by: Zhang, Jian, et al.
Published: (2026)

CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation
by: Li, Jiahao, et al.
Published: (2025)

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models
by: Shen, Qirui, et al.
Published: (2026)

CitySeg: A 3D Open Vocabulary Semantic Segmentation Foundation Model in City-scale Scenarios
by: Xu, Jialei, et al.
Published: (2025)

Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferences
by: Luo, Weijian
Published: (2024)

G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
by: Hu, Wenbo, et al.
Published: (2025)

Backdooring Vision-Language Models with Out-Of-Distribution Data
by: Lyu, Weimin, et al.
Published: (2024)

LATTICE: Democratize High-Fidelity 3D Generation at Scale
by: Lai, Zeqiang, et al.
Published: (2025)

TrojVLM: Backdoor Attack Against Vision Language Models
by: Lyu, Weimin, et al.
Published: (2024)

4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer
by: Wu, Xianfeng, et al.
Published: (2025)

An Empirical Study Into What Matters for Calibrating Vision-Language Models
by: Tu, Weijie, et al.
Published: (2024)