:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ma, Wufei, Wang, Chloe, Chen, Siyi, Peng, Jiawei, Li, Patrick, Yuille, Alan
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2605.12449
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

DINeMo: Learning Neural Mesh Models with no 3D Annotations
by: Guo, Weijie, et al.
Published: (2025)

Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models
by: Lee, Jonathan, et al.
Published: (2025)

SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
by: Ma, Wufei, et al.
Published: (2025)

4D-Animal: Freely Reconstructing Animatable 3D Animals from Videos
by: Zhong, Shanshan, et al.
Published: (2025)

NOVUM: Neural Object Volumes for Robust Object Classification
by: Jesslen, Artur, et al.
Published: (2023)

Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering
by: Wang, Xingrui, et al.
Published: (2024)

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views
by: Shi, Jiaxin, et al.
Published: (2026)

TriDiff-4D: Fast 4D Generation through Diffusion-based Triplane Re-posing
by: Sheung, Eddie Pokming, et al.
Published: (2025)

Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models
by: Wang, Xingrui, et al.
Published: (2025)

Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data
by: Ma, Wufei, et al.
Published: (2024)

SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning
by: Ma, Wufei, et al.
Published: (2025)

3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
by: Ma, Wufei, et al.
Published: (2024)

ImageNet3D: Towards General-Purpose Object-Level 3D Understanding
by: Ma, Wufei, et al.
Published: (2024)

Computer Vision and Its Relationship to Cognitive Science: A perspective from Bayes Decision Theory
by: Yuille, Alan, et al.
Published: (2026)

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference
by: Wang, Feng, et al.
Published: (2023)

SPFormer: Enhancing Vision Transformer with Superpixel Representation
by: Mei, Jieru, et al.
Published: (2024)

ViTamin: Designing Scalable Vision Models in the Vision-Language Era
by: Chen, Jieneng, et al.
Published: (2024)

CRAVES: Controlling Robotic Arm with a Vision-based Economic System
by: Zuo, Yiming, et al.
Published: (2018)

SimDiff: Simulator-constrained Diffusion Model for Physically Plausible Motion Generation
by: Watanabe, Akihisa, et al.
Published: (2025)

Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
by: Zhang, Tiezheng, et al.
Published: (2025)

Large-Scale Label Quality Assessment for Medical Segmentation via a Vision-Language Judge and Synthetic Data
by: Chen, Yixiong, et al.
Published: (2026)

CamFreeDiff: Camera-free Image to Panorama Generation with Diffusion Model
by: Yuan, Xiaoding, et al.
Published: (2024)

Quality Sentinel: Estimating Label Quality and Errors in Medical Segmentation Datasets
by: Chen, Yixiong, et al.
Published: (2024)

Dictionary-based Framework for Interpretable and Consistent Object Parsing
by: Zhang, Tiezheng, et al.
Published: (2025)

ReVision: Refining Video Diffusion with Explicit 3D Motion Modeling
by: Liu, Qihao, et al.
Published: (2025)

PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter
by: Xiao, Junfei, et al.
Published: (2024)

ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning
by: Wang, Yuxuan, et al.
Published: (2024)

EgoSim: Egocentric World Simulator for Embodied Interaction Generation
by: Hao, Jinkun, et al.
Published: (2026)

ViT-5: Vision Transformers for The Mid-2020s
by: Wang, Feng, et al.
Published: (2026)

Generating Images with 3D Annotations Using Diffusion Models
by: Ma, Wufei, et al.
Published: (2023)

Autoregressive Pretraining with Mamba in Vision
by: Ren, Sucheng, et al.
Published: (2024)

VoGE: A Differentiable Volume Renderer using Gaussian Ellipsoids for Analysis-by-Synthesis
by: Wang, Angtian, et al.
Published: (2022)

Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors
by: Paul, Soumava, et al.
Published: (2024)

Can These Views Be One Scene? Evaluating Multiview 3D Consistency when 3D Foundation Models Hallucinate
by: Paul, Soumava, et al.
Published: (2026)

GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models
by: Guan, Yaohan, et al.
Published: (2026)

Medical Vision Generalist: Unifying Medical Imaging Tasks in Context
by: Ren, Sucheng, et al.
Published: (2024)

Leveraging AI Predicted and Expert Revised Annotations in Interactive Segmentation: Continual Tuning or Full Training?
by: Zhang, Tiezheng, et al.
Published: (2024)

Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation
by: Xie, Ziyang, et al.
Published: (2025)

Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution
by: Liu, Qihao, et al.
Published: (2024)

RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models
by: Yang, Timing, et al.
Published: (2025)