Saved in:
| Main Authors: | Gizdov, Andrey, Procopio, Andrea, Li, Yichen, Harari, Daniel, Ullman, Tomer |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.12486 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Towards aligned body representations in vision models
by: Gizdov, Andrey, et al.
Published: (2025)
by: Gizdov, Andrey, et al.
Published: (2025)
Time-Series at the Edge: Tiny Separable CNNs for Wearable Gait Detection and Optimal Sensor Placement
by: Procopio, Andrea, et al.
Published: (2025)
by: Procopio, Andrea, et al.
Published: (2025)
Chain of Time: In-Context Physical Simulation with Image Generation Models
by: Wang, YingQiao, et al.
Published: (2025)
by: Wang, YingQiao, et al.
Published: (2025)
The Illusion-Illusion: Vision Language Models See Illusions Where There are None
by: Ullman, Tomer
Published: (2024)
by: Ullman, Tomer
Published: (2024)
Category Query Learning for Human-Object Interaction Classification
by: Xie, Chi, et al.
Published: (2023)
by: Xie, Chi, et al.
Published: (2023)
MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?
by: Li, Guanzhen, et al.
Published: (2024)
by: Li, Guanzhen, et al.
Published: (2024)
VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis
by: Kang, Donggoo, et al.
Published: (2024)
by: Kang, Donggoo, et al.
Published: (2024)
Object-Centric Vision Token Pruning for Vision Language Models
by: Li, Guangyuan, et al.
Published: (2025)
by: Li, Guangyuan, et al.
Published: (2025)
Like Humans to Few-Shot Learning through Knowledge Permeation of Vision and Text
by: Jia, Yuyu, et al.
Published: (2024)
by: Jia, Yuyu, et al.
Published: (2024)
Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects
by: Li, Wenhao, et al.
Published: (2024)
by: Li, Wenhao, et al.
Published: (2024)
MambaScope: Coarse-to-Fine Scoping for Efficient Vision Mamba
by: Liu, Shanhui, et al.
Published: (2025)
by: Liu, Shanhui, et al.
Published: (2025)
Two-stage Vision Transformers and Hard Masking offer Robust Object Representations
by: Aniraj, Ananthu, et al.
Published: (2025)
by: Aniraj, Ananthu, et al.
Published: (2025)
SynHLMA:Synthesizing Hand Language Manipulation for Articulated Object with Discrete Human Object Interaction Representation
by: zhi, Wang, et al.
Published: (2025)
by: zhi, Wang, et al.
Published: (2025)
CoTZero: Annotation-Free Human-Like Vision Reasoning via Hierarchical Synthetic CoT
by: Du, Chengyi, et al.
Published: (2026)
by: Du, Chengyi, et al.
Published: (2026)
Investigating Fine- and Coarse-grained Structural Correspondences Between Deep Neural Networks and Human Object Image Similarity Judgments Using Unsupervised Alignment
by: Takahashi, Soh, et al.
Published: (2025)
by: Takahashi, Soh, et al.
Published: (2025)
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
by: Pramanick, Shraman, et al.
Published: (2023)
by: Pramanick, Shraman, et al.
Published: (2023)
DualEdit: Dual Editing for Knowledge Updating in Vision-Language Models
by: Shi, Zhiyi, et al.
Published: (2025)
by: Shi, Zhiyi, et al.
Published: (2025)
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
by: Luo, Junwei, et al.
Published: (2025)
by: Luo, Junwei, et al.
Published: (2025)
MESH -- Understanding Videos Like Human: Measuring Hallucinations in Large Video Models
by: Yang, Garry, et al.
Published: (2025)
by: Yang, Garry, et al.
Published: (2025)
Masked Modeling for Self-supervised Representation Learning on Vision and Beyond
by: Li, Siyuan, et al.
Published: (2023)
by: Li, Siyuan, et al.
Published: (2023)
The Geometry of Representational Failures in Vision Language Models
by: Savietto, Daniele, et al.
Published: (2026)
by: Savietto, Daniele, et al.
Published: (2026)
Human-Object Interaction from Human-Level Instructions
by: Wu, Zhen, et al.
Published: (2024)
by: Wu, Zhen, et al.
Published: (2024)
MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing
by: Li, Chenxi, et al.
Published: (2025)
by: Li, Chenxi, et al.
Published: (2025)
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models
by: Cai, Huanqia, et al.
Published: (2025)
by: Cai, Huanqia, et al.
Published: (2025)
ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes
by: Malik, Hashmat Shadab, et al.
Published: (2024)
by: Malik, Hashmat Shadab, et al.
Published: (2024)
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies
by: Du, Fan, et al.
Published: (2026)
by: Du, Fan, et al.
Published: (2026)
CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models
by: Cao, Zongsheng, et al.
Published: (2025)
by: Cao, Zongsheng, et al.
Published: (2025)
Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models
by: He, Yuting, et al.
Published: (2026)
by: He, Yuting, et al.
Published: (2026)
DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models
by: Wang, JiYang, et al.
Published: (2026)
by: Wang, JiYang, et al.
Published: (2026)
Controllable Video Object Insertion via Multiview Priors
by: Qi, Xia, et al.
Published: (2026)
by: Qi, Xia, et al.
Published: (2026)
FMNet: Frequency-Assisted Mamba-Like Linear Attention Network for Camouflaged Object Detection
by: Deng, Ming, et al.
Published: (2025)
by: Deng, Ming, et al.
Published: (2025)
Deep Extrinsic Manifold Representation for Vision Tasks
by: Zhang, Tongtong, et al.
Published: (2024)
by: Zhang, Tongtong, et al.
Published: (2024)
Untrained neural networks can demonstrate memorization-independent abstract reasoning
by: Barak, Tomer, et al.
Published: (2024)
by: Barak, Tomer, et al.
Published: (2024)
Beyond Boundaries: Leveraging Vision Foundation Models for Source-Free Object Detection
by: Yao, Huizai, et al.
Published: (2025)
by: Yao, Huizai, et al.
Published: (2025)
Progressive Fine-to-Coarse Reconstruction for Accurate Low-Bit Post-Training Quantization in Vision Transformers
by: Ding, Rui, et al.
Published: (2024)
by: Ding, Rui, et al.
Published: (2024)
CoMa: Contextual Massing Generation with Vision-Language Models
by: Maslov, Evgenii, et al.
Published: (2026)
by: Maslov, Evgenii, et al.
Published: (2026)
Anatomy-Anchored Self-Supervision: Distilling Vision Foundation Models for Invariant Ultrasound Representation
by: Zhu, Chunzheng, et al.
Published: (2026)
by: Zhu, Chunzheng, et al.
Published: (2026)
Multi-Object Hallucination in Vision-Language Models
by: Chen, Xuweiyi, et al.
Published: (2024)
by: Chen, Xuweiyi, et al.
Published: (2024)
HarmoCLIP: Harmonizing Global and Regional Representations in Contrastive Vision-Language Models
by: Zeng, Haoxi, et al.
Published: (2025)
by: Zeng, Haoxi, et al.
Published: (2025)
Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms
by: Zhang, Miaosen, et al.
Published: (2024)
by: Zhang, Miaosen, et al.
Published: (2024)
Similar Items
-
Towards aligned body representations in vision models
by: Gizdov, Andrey, et al.
Published: (2025) -
Time-Series at the Edge: Tiny Separable CNNs for Wearable Gait Detection and Optimal Sensor Placement
by: Procopio, Andrea, et al.
Published: (2025) -
Chain of Time: In-Context Physical Simulation with Image Generation Models
by: Wang, YingQiao, et al.
Published: (2025) -
The Illusion-Illusion: Vision Language Models See Illusions Where There are None
by: Ullman, Tomer
Published: (2024) -
Category Query Learning for Human-Object Interaction Classification
by: Xie, Chi, et al.
Published: (2023)