Saved in:
| Main Authors: | Yuille, Alan, Kersten, Daniel |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.00289 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference
by: Wang, Feng, et al.
Published: (2023)
by: Wang, Feng, et al.
Published: (2023)
SPFormer: Enhancing Vision Transformer with Superpixel Representation
by: Mei, Jieru, et al.
Published: (2024)
by: Mei, Jieru, et al.
Published: (2024)
ViTamin: Designing Scalable Vision Models in the Vision-Language Era
by: Chen, Jieneng, et al.
Published: (2024)
by: Chen, Jieneng, et al.
Published: (2024)
LychSim: A Controllable and Interactive Simulation Framework for Vision Research
by: Ma, Wufei, et al.
Published: (2026)
by: Ma, Wufei, et al.
Published: (2026)
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter
by: Xiao, Junfei, et al.
Published: (2024)
by: Xiao, Junfei, et al.
Published: (2024)
Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
by: Zhang, Tiezheng, et al.
Published: (2025)
by: Zhang, Tiezheng, et al.
Published: (2025)
ReVision: Refining Video Diffusion with Explicit 3D Motion Modeling
by: Liu, Qihao, et al.
Published: (2025)
by: Liu, Qihao, et al.
Published: (2025)
Can These Views Be One Scene? Evaluating Multiview 3D Consistency when 3D Foundation Models Hallucinate
by: Paul, Soumava, et al.
Published: (2026)
by: Paul, Soumava, et al.
Published: (2026)
Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors
by: Paul, Soumava, et al.
Published: (2024)
by: Paul, Soumava, et al.
Published: (2024)
Quality Sentinel: Estimating Label Quality and Errors in Medical Segmentation Datasets
by: Chen, Yixiong, et al.
Published: (2024)
by: Chen, Yixiong, et al.
Published: (2024)
GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models
by: Guan, Yaohan, et al.
Published: (2026)
by: Guan, Yaohan, et al.
Published: (2026)
ViT-5: Vision Transformers for The Mid-2020s
by: Wang, Feng, et al.
Published: (2026)
by: Wang, Feng, et al.
Published: (2026)
CRAVES: Controlling Robotic Arm with a Vision-based Economic System
by: Zuo, Yiming, et al.
Published: (2018)
by: Zuo, Yiming, et al.
Published: (2018)
Beyond Masks: The Case for Medical Image Parsing
by: Gupta, Siddharth, et al.
Published: (2026)
by: Gupta, Siddharth, et al.
Published: (2026)
A Bayesian Approach to OOD Robustness in Image Classification
by: Kaushik, Prakhar, et al.
Published: (2024)
by: Kaushik, Prakhar, et al.
Published: (2024)
Large-Scale Label Quality Assessment for Medical Segmentation via a Vision-Language Judge and Synthetic Data
by: Chen, Yixiong, et al.
Published: (2026)
by: Chen, Yixiong, et al.
Published: (2026)
RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models
by: Yang, Timing, et al.
Published: (2025)
by: Yang, Timing, et al.
Published: (2025)
From Pixels to Objects: A Hierarchical Approach for Part and Object Segmentation Using Local and Global Aggregation
by: Xie, Yunfei, et al.
Published: (2024)
by: Xie, Yunfei, et al.
Published: (2024)
ViMix-14M: A Curated Multi-Source Video-Text Dataset with Long-Form, High-Quality Captions and Crawl-Free Access
by: Yang, Timing, et al.
Published: (2025)
by: Yang, Timing, et al.
Published: (2025)
Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution
by: Liu, Qihao, et al.
Published: (2024)
by: Liu, Qihao, et al.
Published: (2024)
DINeMo: Learning Neural Mesh Models with no 3D Annotations
by: Guo, Weijie, et al.
Published: (2025)
by: Guo, Weijie, et al.
Published: (2025)
Dictionary-based Framework for Interpretable and Consistent Object Parsing
by: Zhang, Tiezheng, et al.
Published: (2025)
by: Zhang, Tiezheng, et al.
Published: (2025)
ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning
by: Wang, Yuxuan, et al.
Published: (2024)
by: Wang, Yuxuan, et al.
Published: (2024)
From Pixel to Cancer: Cellular Automata in Computed Tomography
by: Lai, Yuxiang, et al.
Published: (2024)
by: Lai, Yuxiang, et al.
Published: (2024)
Medical Vision Generalist: Unifying Medical Imaging Tasks in Context
by: Ren, Sucheng, et al.
Published: (2024)
by: Ren, Sucheng, et al.
Published: (2024)
Generative World Explorer
by: Lu, Taiming, et al.
Published: (2024)
by: Lu, Taiming, et al.
Published: (2024)
CoCa-CXR: Contrastive Captioners Learn Strong Temporal Structures for Chest X-Ray Vision-Language Understanding
by: Chen, Yixiong, et al.
Published: (2025)
by: Chen, Yixiong, et al.
Published: (2025)
Mamba-R: Vision Mamba ALSO Needs Registers
by: Wang, Feng, et al.
Published: (2024)
by: Wang, Feng, et al.
Published: (2024)
How Well Do Supervised 3D Models Transfer to Medical Imaging Tasks?
by: Li, Wenxuan, et al.
Published: (2025)
by: Li, Wenxuan, et al.
Published: (2025)
Fuzzy Theory in Computer Vision: A Review
by: Yerkin, Adilet, et al.
Published: (2025)
by: Yerkin, Adilet, et al.
Published: (2025)
Efficient Large Multi-modal Models via Visual Context Compression
by: Chen, Jieneng, et al.
Published: (2024)
by: Chen, Jieneng, et al.
Published: (2024)
Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency
by: Wang, Feng, et al.
Published: (2024)
by: Wang, Feng, et al.
Published: (2024)
Prompt-Based Exemplar Super-Compression and Regeneration for Class-Incremental Learning
by: Duan, Ruxiao, et al.
Published: (2023)
by: Duan, Ruxiao, et al.
Published: (2023)
CamFreeDiff: Camera-free Image to Panorama Generation with Diffusion Model
by: Yuan, Xiaoding, et al.
Published: (2024)
by: Yuan, Xiaoding, et al.
Published: (2024)
Name That Part: 3D Part Segmentation and Naming
by: Paul, Soumava, et al.
Published: (2025)
by: Paul, Soumava, et al.
Published: (2025)
DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data
by: Liu, Qihao, et al.
Published: (2024)
by: Liu, Qihao, et al.
Published: (2024)
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models
by: Li, Zhuowan, et al.
Published: (2022)
by: Li, Zhuowan, et al.
Published: (2022)
VoGE: A Differentiable Volume Renderer using Gaussian Ellipsoids for Analysis-by-Synthesis
by: Wang, Angtian, et al.
Published: (2022)
by: Wang, Angtian, et al.
Published: (2022)
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering
by: Chen, Yixiong, et al.
Published: (2025)
by: Chen, Yixiong, et al.
Published: (2025)
Autoregressive Pretraining with Mamba in Vision
by: Ren, Sucheng, et al.
Published: (2024)
by: Ren, Sucheng, et al.
Published: (2024)
Similar Items
-
SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference
by: Wang, Feng, et al.
Published: (2023) -
SPFormer: Enhancing Vision Transformer with Superpixel Representation
by: Mei, Jieru, et al.
Published: (2024) -
ViTamin: Designing Scalable Vision Models in the Vision-Language Era
by: Chen, Jieneng, et al.
Published: (2024) -
LychSim: A Controllable and Interactive Simulation Framework for Vision Research
by: Ma, Wufei, et al.
Published: (2026) -
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter
by: Xiao, Junfei, et al.
Published: (2024)