Saved in:
| Main Authors: | Bai, Weimin, Li, Yubo, Luo, Weijian, Lai, Zeqiang, Wang, Yequan, Chen, Wenzheng, Sun, He |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.14271 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation
by: Bai, Weimin, et al.
Published: (2025)
by: Bai, Weimin, et al.
Published: (2025)
Dive3D: Diverse Distillation-based Text-to-3D Generation via Score Implicit Matching
by: Bai, Weimin, et al.
Published: (2025)
by: Bai, Weimin, et al.
Published: (2025)
Integrating Amortized Inference with Diffusion Models for Learning Clean Distribution from Corrupted Images
by: Wang, Yifei, et al.
Published: (2024)
by: Wang, Yifei, et al.
Published: (2024)
An Expectation-Maximization Algorithm for Training Clean Diffusion Models from Corrupted Observations
by: Bai, Weimin, et al.
Published: (2024)
by: Bai, Weimin, et al.
Published: (2024)
Blind Inversion using Latent Diffusion Priors
by: Bai, Weimin, et al.
Published: (2024)
by: Bai, Weimin, et al.
Published: (2024)
Unbiased Diffusion Variational Inversion via Principled Posterior Matching
by: Bai, Weimin, et al.
Published: (2026)
by: Bai, Weimin, et al.
Published: (2026)
Uni-Instruct: One-step Diffusion Model through Unified Diffusion Divergence Instruction
by: Wang, Yifei, et al.
Published: (2025)
by: Wang, Yifei, et al.
Published: (2025)
Learning Diffusion Model from Noisy Measurement using Principled Expectation-Maximization Method
by: Bai, Weimin, et al.
Published: (2024)
by: Bai, Weimin, et al.
Published: (2024)
Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation
by: Ma, Weijian, et al.
Published: (2026)
by: Ma, Weijian, et al.
Published: (2026)
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
by: Wu, Jiannan, et al.
Published: (2024)
by: Wu, Jiannan, et al.
Published: (2024)
Dia-LLaMA: Towards Large Language Model-driven CT Report Generation
by: Chen, Zhixuan, et al.
Published: (2024)
by: Chen, Zhixuan, et al.
Published: (2024)
InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior
by: Bai, Weimin, et al.
Published: (2025)
by: Bai, Weimin, et al.
Published: (2025)
Continual Learning with Vision-Language Models via Semantic-Geometry Preservation
by: He, Chiyuan, et al.
Published: (2026)
by: He, Chiyuan, et al.
Published: (2026)
SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models
by: Zhao, Ruosen, et al.
Published: (2025)
by: Zhao, Ruosen, et al.
Published: (2025)
Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes
by: Gholami, Mohsen, et al.
Published: (2025)
by: Gholami, Mohsen, et al.
Published: (2025)
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
by: Xu, Guowei, et al.
Published: (2024)
by: Xu, Guowei, et al.
Published: (2024)
3D-GPT: Procedural 3D Modeling with Large Language Models
by: Sun, Chunyi, et al.
Published: (2023)
by: Sun, Chunyi, et al.
Published: (2023)
Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
by: Qi, Jianing, et al.
Published: (2025)
by: Qi, Jianing, et al.
Published: (2025)
Grounded 3D-Aware Spatial Vision-Language Modeling
by: Cheng, An-Chieh, et al.
Published: (2026)
by: Cheng, An-Chieh, et al.
Published: (2026)
Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models
by: Monon, Mashrafi, et al.
Published: (2026)
by: Monon, Mashrafi, et al.
Published: (2026)
Unaligned RGB Guided Hyperspectral Image Super-Resolution with Spatial-Spectral Concordance
by: Zhang, Yingkai, et al.
Published: (2025)
by: Zhang, Yingkai, et al.
Published: (2025)
Large Language Model with Region-guided Referring and Grounding for CT Report Generation
by: Chen, Zhixuan, et al.
Published: (2024)
by: Chen, Zhixuan, et al.
Published: (2024)
HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models
by: Liang, Huizhi, et al.
Published: (2026)
by: Liang, Huizhi, et al.
Published: (2026)
Spatial-aware Vision Language Model for Autonomous Driving
by: Wei, Weijie, et al.
Published: (2025)
by: Wei, Weijie, et al.
Published: (2025)
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
by: Liu, Yifan, et al.
Published: (2025)
by: Liu, Yifan, et al.
Published: (2025)
Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks
by: Xie, Peng, et al.
Published: (2024)
by: Xie, Peng, et al.
Published: (2024)
Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
by: Jiang, Jerry, et al.
Published: (2026)
by: Jiang, Jerry, et al.
Published: (2026)
Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models
by: Wang, Xiaoyan, et al.
Published: (2025)
by: Wang, Xiaoyan, et al.
Published: (2025)
Backdoor Attack on Vision Language Models with Stealthy Semantic Manipulation
by: Zhong, Zhiyuan, et al.
Published: (2025)
by: Zhong, Zhiyuan, et al.
Published: (2025)
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
by: Zhang, Jian, et al.
Published: (2026)
by: Zhang, Jian, et al.
Published: (2026)
CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation
by: Li, Jiahao, et al.
Published: (2025)
by: Li, Jiahao, et al.
Published: (2025)
ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models
by: Shen, Qirui, et al.
Published: (2026)
by: Shen, Qirui, et al.
Published: (2026)
CitySeg: A 3D Open Vocabulary Semantic Segmentation Foundation Model in City-scale Scenarios
by: Xu, Jialei, et al.
Published: (2025)
by: Xu, Jialei, et al.
Published: (2025)
Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferences
by: Luo, Weijian
Published: (2024)
by: Luo, Weijian
Published: (2024)
G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
by: Hu, Wenbo, et al.
Published: (2025)
by: Hu, Wenbo, et al.
Published: (2025)
Backdooring Vision-Language Models with Out-Of-Distribution Data
by: Lyu, Weimin, et al.
Published: (2024)
by: Lyu, Weimin, et al.
Published: (2024)
LATTICE: Democratize High-Fidelity 3D Generation at Scale
by: Lai, Zeqiang, et al.
Published: (2025)
by: Lai, Zeqiang, et al.
Published: (2025)
TrojVLM: Backdoor Attack Against Vision Language Models
by: Lyu, Weimin, et al.
Published: (2024)
by: Lyu, Weimin, et al.
Published: (2024)
4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer
by: Wu, Xianfeng, et al.
Published: (2025)
by: Wu, Xianfeng, et al.
Published: (2025)
An Empirical Study Into What Matters for Calibrating Vision-Language Models
by: Tu, Weijie, et al.
Published: (2024)
by: Tu, Weijie, et al.
Published: (2024)
Similar Items
-
Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation
by: Bai, Weimin, et al.
Published: (2025) -
Dive3D: Diverse Distillation-based Text-to-3D Generation via Score Implicit Matching
by: Bai, Weimin, et al.
Published: (2025) -
Integrating Amortized Inference with Diffusion Models for Learning Clean Distribution from Corrupted Images
by: Wang, Yifei, et al.
Published: (2024) -
An Expectation-Maximization Algorithm for Training Clean Diffusion Models from Corrupted Observations
by: Bai, Weimin, et al.
Published: (2024) -
Blind Inversion using Latent Diffusion Priors
by: Bai, Weimin, et al.
Published: (2024)