:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Gizdov, Andrey, Procopio, Andrea, Li, Yichen, Harari, Daniel, Ullman, Tomer
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.12486
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Towards aligned body representations in vision models
by: Gizdov, Andrey, et al.
Published: (2025)

Time-Series at the Edge: Tiny Separable CNNs for Wearable Gait Detection and Optimal Sensor Placement
by: Procopio, Andrea, et al.
Published: (2025)

Chain of Time: In-Context Physical Simulation with Image Generation Models
by: Wang, YingQiao, et al.
Published: (2025)

The Illusion-Illusion: Vision Language Models See Illusions Where There are None
by: Ullman, Tomer
Published: (2024)

Category Query Learning for Human-Object Interaction Classification
by: Xie, Chi, et al.
Published: (2023)

MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?
by: Li, Guanzhen, et al.
Published: (2024)

VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis
by: Kang, Donggoo, et al.
Published: (2024)

Object-Centric Vision Token Pruning for Vision Language Models
by: Li, Guangyuan, et al.
Published: (2025)

Like Humans to Few-Shot Learning through Knowledge Permeation of Vision and Text
by: Jia, Yuyu, et al.
Published: (2024)

Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects
by: Li, Wenhao, et al.
Published: (2024)

MambaScope: Coarse-to-Fine Scoping for Efficient Vision Mamba
by: Liu, Shanhui, et al.
Published: (2025)

Two-stage Vision Transformers and Hard Masking offer Robust Object Representations
by: Aniraj, Ananthu, et al.
Published: (2025)

SynHLMA:Synthesizing Hand Language Manipulation for Articulated Object with Discrete Human Object Interaction Representation
by: zhi, Wang, et al.
Published: (2025)

CoTZero: Annotation-Free Human-Like Vision Reasoning via Hierarchical Synthetic CoT
by: Du, Chengyi, et al.
Published: (2026)

Investigating Fine- and Coarse-grained Structural Correspondences Between Deep Neural Networks and Human Object Image Similarity Judgments Using Unsupervised Alignment
by: Takahashi, Soh, et al.
Published: (2025)

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
by: Pramanick, Shraman, et al.
Published: (2023)

DualEdit: Dual Editing for Knowledge Updating in Vision-Language Models
by: Shi, Zhiyi, et al.
Published: (2025)

When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
by: Luo, Junwei, et al.
Published: (2025)

MESH -- Understanding Videos Like Human: Measuring Hallucinations in Large Video Models
by: Yang, Garry, et al.
Published: (2025)

Masked Modeling for Self-supervised Representation Learning on Vision and Beyond
by: Li, Siyuan, et al.
Published: (2023)

The Geometry of Representational Failures in Vision Language Models
by: Savietto, Daniele, et al.
Published: (2026)

Human-Object Interaction from Human-Level Instructions
by: Wu, Zhen, et al.
Published: (2024)

MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing
by: Li, Chenxi, et al.
Published: (2025)

MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models
by: Cai, Huanqia, et al.
Published: (2025)

ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes
by: Malik, Hashmat Shadab, et al.
Published: (2024)

CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies
by: Du, Fan, et al.
Published: (2026)

CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models
by: Cao, Zongsheng, et al.
Published: (2025)

Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models
by: He, Yuting, et al.
Published: (2026)

DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models
by: Wang, JiYang, et al.
Published: (2026)

Controllable Video Object Insertion via Multiview Priors
by: Qi, Xia, et al.
Published: (2026)

FMNet: Frequency-Assisted Mamba-Like Linear Attention Network for Camouflaged Object Detection
by: Deng, Ming, et al.
Published: (2025)

Deep Extrinsic Manifold Representation for Vision Tasks
by: Zhang, Tongtong, et al.
Published: (2024)

Untrained neural networks can demonstrate memorization-independent abstract reasoning
by: Barak, Tomer, et al.
Published: (2024)

Beyond Boundaries: Leveraging Vision Foundation Models for Source-Free Object Detection
by: Yao, Huizai, et al.
Published: (2025)

Progressive Fine-to-Coarse Reconstruction for Accurate Low-Bit Post-Training Quantization in Vision Transformers
by: Ding, Rui, et al.
Published: (2024)

CoMa: Contextual Massing Generation with Vision-Language Models
by: Maslov, Evgenii, et al.
Published: (2026)

Anatomy-Anchored Self-Supervision: Distilling Vision Foundation Models for Invariant Ultrasound Representation
by: Zhu, Chunzheng, et al.
Published: (2026)

Multi-Object Hallucination in Vision-Language Models
by: Chen, Xuweiyi, et al.
Published: (2024)

HarmoCLIP: Harmonizing Global and Regional Representations in Contrastive Vision-Language Models
by: Zeng, Haoxi, et al.
Published: (2025)

Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms
by: Zhang, Miaosen, et al.
Published: (2024)