Saved in:
| Main Authors: | Ma, Wufei, Wang, Chloe, Chen, Siyi, Peng, Jiawei, Li, Patrick, Yuille, Alan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.12449 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
DINeMo: Learning Neural Mesh Models with no 3D Annotations
by: Guo, Weijie, et al.
Published: (2025)
by: Guo, Weijie, et al.
Published: (2025)
Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models
by: Lee, Jonathan, et al.
Published: (2025)
by: Lee, Jonathan, et al.
Published: (2025)
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
by: Ma, Wufei, et al.
Published: (2025)
by: Ma, Wufei, et al.
Published: (2025)
4D-Animal: Freely Reconstructing Animatable 3D Animals from Videos
by: Zhong, Shanshan, et al.
Published: (2025)
by: Zhong, Shanshan, et al.
Published: (2025)
NOVUM: Neural Object Volumes for Robust Object Classification
by: Jesslen, Artur, et al.
Published: (2023)
by: Jesslen, Artur, et al.
Published: (2023)
Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering
by: Wang, Xingrui, et al.
Published: (2024)
by: Wang, Xingrui, et al.
Published: (2024)
PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views
by: Shi, Jiaxin, et al.
Published: (2026)
by: Shi, Jiaxin, et al.
Published: (2026)
TriDiff-4D: Fast 4D Generation through Diffusion-based Triplane Re-posing
by: Sheung, Eddie Pokming, et al.
Published: (2025)
by: Sheung, Eddie Pokming, et al.
Published: (2025)
Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models
by: Wang, Xingrui, et al.
Published: (2025)
by: Wang, Xingrui, et al.
Published: (2025)
Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data
by: Ma, Wufei, et al.
Published: (2024)
by: Ma, Wufei, et al.
Published: (2024)
SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning
by: Ma, Wufei, et al.
Published: (2025)
by: Ma, Wufei, et al.
Published: (2025)
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
by: Ma, Wufei, et al.
Published: (2024)
by: Ma, Wufei, et al.
Published: (2024)
ImageNet3D: Towards General-Purpose Object-Level 3D Understanding
by: Ma, Wufei, et al.
Published: (2024)
by: Ma, Wufei, et al.
Published: (2024)
Computer Vision and Its Relationship to Cognitive Science: A perspective from Bayes Decision Theory
by: Yuille, Alan, et al.
Published: (2026)
by: Yuille, Alan, et al.
Published: (2026)
SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference
by: Wang, Feng, et al.
Published: (2023)
by: Wang, Feng, et al.
Published: (2023)
SPFormer: Enhancing Vision Transformer with Superpixel Representation
by: Mei, Jieru, et al.
Published: (2024)
by: Mei, Jieru, et al.
Published: (2024)
ViTamin: Designing Scalable Vision Models in the Vision-Language Era
by: Chen, Jieneng, et al.
Published: (2024)
by: Chen, Jieneng, et al.
Published: (2024)
CRAVES: Controlling Robotic Arm with a Vision-based Economic System
by: Zuo, Yiming, et al.
Published: (2018)
by: Zuo, Yiming, et al.
Published: (2018)
SimDiff: Simulator-constrained Diffusion Model for Physically Plausible Motion Generation
by: Watanabe, Akihisa, et al.
Published: (2025)
by: Watanabe, Akihisa, et al.
Published: (2025)
Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
by: Zhang, Tiezheng, et al.
Published: (2025)
by: Zhang, Tiezheng, et al.
Published: (2025)
Large-Scale Label Quality Assessment for Medical Segmentation via a Vision-Language Judge and Synthetic Data
by: Chen, Yixiong, et al.
Published: (2026)
by: Chen, Yixiong, et al.
Published: (2026)
CamFreeDiff: Camera-free Image to Panorama Generation with Diffusion Model
by: Yuan, Xiaoding, et al.
Published: (2024)
by: Yuan, Xiaoding, et al.
Published: (2024)
Quality Sentinel: Estimating Label Quality and Errors in Medical Segmentation Datasets
by: Chen, Yixiong, et al.
Published: (2024)
by: Chen, Yixiong, et al.
Published: (2024)
Dictionary-based Framework for Interpretable and Consistent Object Parsing
by: Zhang, Tiezheng, et al.
Published: (2025)
by: Zhang, Tiezheng, et al.
Published: (2025)
ReVision: Refining Video Diffusion with Explicit 3D Motion Modeling
by: Liu, Qihao, et al.
Published: (2025)
by: Liu, Qihao, et al.
Published: (2025)
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter
by: Xiao, Junfei, et al.
Published: (2024)
by: Xiao, Junfei, et al.
Published: (2024)
ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning
by: Wang, Yuxuan, et al.
Published: (2024)
by: Wang, Yuxuan, et al.
Published: (2024)
EgoSim: Egocentric World Simulator for Embodied Interaction Generation
by: Hao, Jinkun, et al.
Published: (2026)
by: Hao, Jinkun, et al.
Published: (2026)
ViT-5: Vision Transformers for The Mid-2020s
by: Wang, Feng, et al.
Published: (2026)
by: Wang, Feng, et al.
Published: (2026)
Generating Images with 3D Annotations Using Diffusion Models
by: Ma, Wufei, et al.
Published: (2023)
by: Ma, Wufei, et al.
Published: (2023)
Autoregressive Pretraining with Mamba in Vision
by: Ren, Sucheng, et al.
Published: (2024)
by: Ren, Sucheng, et al.
Published: (2024)
VoGE: A Differentiable Volume Renderer using Gaussian Ellipsoids for Analysis-by-Synthesis
by: Wang, Angtian, et al.
Published: (2022)
by: Wang, Angtian, et al.
Published: (2022)
Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors
by: Paul, Soumava, et al.
Published: (2024)
by: Paul, Soumava, et al.
Published: (2024)
Can These Views Be One Scene? Evaluating Multiview 3D Consistency when 3D Foundation Models Hallucinate
by: Paul, Soumava, et al.
Published: (2026)
by: Paul, Soumava, et al.
Published: (2026)
GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models
by: Guan, Yaohan, et al.
Published: (2026)
by: Guan, Yaohan, et al.
Published: (2026)
Medical Vision Generalist: Unifying Medical Imaging Tasks in Context
by: Ren, Sucheng, et al.
Published: (2024)
by: Ren, Sucheng, et al.
Published: (2024)
Leveraging AI Predicted and Expert Revised Annotations in Interactive Segmentation: Continual Tuning or Full Training?
by: Zhang, Tiezheng, et al.
Published: (2024)
by: Zhang, Tiezheng, et al.
Published: (2024)
Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation
by: Xie, Ziyang, et al.
Published: (2025)
by: Xie, Ziyang, et al.
Published: (2025)
Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution
by: Liu, Qihao, et al.
Published: (2024)
by: Liu, Qihao, et al.
Published: (2024)
RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models
by: Yang, Timing, et al.
Published: (2025)
by: Yang, Timing, et al.
Published: (2025)
Similar Items
-
DINeMo: Learning Neural Mesh Models with no 3D Annotations
by: Guo, Weijie, et al.
Published: (2025) -
Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models
by: Lee, Jonathan, et al.
Published: (2025) -
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
by: Ma, Wufei, et al.
Published: (2025) -
4D-Animal: Freely Reconstructing Animatable 3D Animals from Videos
by: Zhong, Shanshan, et al.
Published: (2025) -
NOVUM: Neural Object Volumes for Robust Object Classification
by: Jesslen, Artur, et al.
Published: (2023)