Saved in:
| Main Authors: | Wu, Hongyu, Yang, Pengwan, Asano, Yuki M., Snoek, Cees G. M. |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.19331 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs
by: Dorkenwald, Michael, et al.
Published: (2024)
by: Dorkenwald, Michael, et al.
Published: (2024)
Elastic ViTs from Pretrained Models without Retraining
by: Simoncini, Walter, et al.
Published: (2025)
by: Simoncini, Walter, et al.
Published: (2025)
Redefining Normal: A Novel Object-Level Approach for Multi-Object Novelty Detection
by: Salehi, Mohammadreza, et al.
Published: (2024)
by: Salehi, Mohammadreza, et al.
Published: (2024)
Lost in Time: A New Temporal Benchmark for VideoLLMs
by: Cores, Daniel, et al.
Published: (2024)
by: Cores, Daniel, et al.
Published: (2024)
MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning
by: Salehi, Mohammadreza, et al.
Published: (2025)
by: Salehi, Mohammadreza, et al.
Published: (2025)
GeneralAD: Anomaly Detection Across Domains by Attending to Distorted Features
by: Sträter, Luc P. J., et al.
Published: (2024)
by: Sträter, Luc P. J., et al.
Published: (2024)
Any-Shift Prompting for Generalization over Distributions
by: Xiao, Zehao, et al.
Published: (2024)
by: Xiao, Zehao, et al.
Published: (2024)
SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery
by: Rastegar, Sarah, et al.
Published: (2024)
by: Rastegar, Sarah, et al.
Published: (2024)
SIGMA: Sinkhorn-Guided Masked Video Modeling
by: Salehi, Mohammadreza, et al.
Published: (2024)
by: Salehi, Mohammadreza, et al.
Published: (2024)
TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning
by: Bhowmik, Aritra, et al.
Published: (2024)
by: Bhowmik, Aritra, et al.
Published: (2024)
TULIP: Token-length Upgraded CLIP
by: Najdenkoska, Ivona, et al.
Published: (2024)
by: Najdenkoska, Ivona, et al.
Published: (2024)
QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain
by: Sun, Wenfang, et al.
Published: (2024)
by: Sun, Wenfang, et al.
Published: (2024)
SimPLR: A Simple and Plain Transformer for Efficient Object Detection and Segmentation
by: Nguyen, Duy-Kien, et al.
Published: (2023)
by: Nguyen, Duy-Kien, et al.
Published: (2023)
Prompt Diffusion Robustifies Any-Modality Prompt Learning
by: Du, Yingjun, et al.
Published: (2024)
by: Du, Yingjun, et al.
Published: (2024)
Training-Free Semantic Segmentation via LLM-Supervision
by: Sun, Wenfang, et al.
Published: (2024)
by: Sun, Wenfang, et al.
Published: (2024)
Low-Resource Vision Challenges for Foundation Models
by: Zhang, Yunhua, et al.
Published: (2024)
by: Zhang, Yunhua, et al.
Published: (2024)
Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning
by: Liu, Huabin, et al.
Published: (2025)
by: Liu, Huabin, et al.
Published: (2025)
SAMPart3D: Segment Any Part in 3D Objects
by: Yang, Yunhan, et al.
Published: (2024)
by: Yang, Yunhan, et al.
Published: (2024)
IPO: Interpretable Prompt Optimization for Vision-Language Models
by: Du, Yingjun, et al.
Published: (2024)
by: Du, Yingjun, et al.
Published: (2024)
SAI3D: Segment Any Instance in 3D Scenes
by: Yin, Yingda, et al.
Published: (2023)
by: Yin, Yingda, et al.
Published: (2023)
Dual Guidance Semi-Supervised Action Detection
by: Singh, Ankit, et al.
Published: (2025)
by: Singh, Ankit, et al.
Published: (2025)
SuperDisco: Super-Class Discovery Improves Visual Recognition for the Long-Tail
by: Du, Yingjun, et al.
Published: (2023)
by: Du, Yingjun, et al.
Published: (2023)
LocoMotion: Learning Motion-Focused Video-Language Representations
by: Doughty, Hazel, et al.
Published: (2024)
by: Doughty, Hazel, et al.
Published: (2024)
SAS: Segment Any 3D Scene with Integrated 2D Priors
by: Li, Zhuoyuan, et al.
Published: (2025)
by: Li, Zhuoyuan, et al.
Published: (2025)
Beyond Coarse-Grained Matching in Video-Text Retrieval
by: Chen, Aozhu, et al.
Published: (2024)
by: Chen, Aozhu, et al.
Published: (2024)
MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models
by: Bhowmik, Aritra, et al.
Published: (2025)
by: Bhowmik, Aritra, et al.
Published: (2025)
RegionReasoner: Region-Grounded Multi-Round Visual Reasoning
by: Sun, Wenfang, et al.
Published: (2026)
by: Sun, Wenfang, et al.
Published: (2026)
Union-over-Intersections: Object Detection beyond Winner-Takes-All
by: Bhowmik, Aritra, et al.
Published: (2023)
by: Bhowmik, Aritra, et al.
Published: (2023)
NeoBabel: A Multilingual Open Tower for Visual Generation
by: Derakhshani, Mohammad Mahdi, et al.
Published: (2025)
by: Derakhshani, Mohammad Mahdi, et al.
Published: (2025)
The Sound of Water: Inferring Physical Properties from Pouring Liquids
by: Bagad, Piyush, et al.
Published: (2024)
by: Bagad, Piyush, et al.
Published: (2024)
Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detection
by: Salehi, Alireza, et al.
Published: (2025)
by: Salehi, Alireza, et al.
Published: (2025)
FVO: Fast Visual Odometry with Transformers
by: Yugay, Vlardimir, et al.
Published: (2025)
by: Yugay, Vlardimir, et al.
Published: (2025)
Segment Any 3D Gaussians
by: Cen, Jiazhong, et al.
Published: (2023)
by: Cen, Jiazhong, et al.
Published: (2023)
SAMSelect: A Spectral Index Search for Marine Debris Visualization using Segment Anything
by: van Dalen, Joost, et al.
Published: (2025)
by: van Dalen, Joost, et al.
Published: (2025)
Find Any Part in 3D
by: Ma, Ziqi, et al.
Published: (2024)
by: Ma, Ziqi, et al.
Published: (2024)
Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs
by: Wang, Ziqi, et al.
Published: (2025)
by: Wang, Ziqi, et al.
Published: (2025)
TAP-CT: 3D Task-Agnostic Pretraining of Computed Tomography Foundation Models
by: Veenboer, Tim, et al.
Published: (2025)
by: Veenboer, Tim, et al.
Published: (2025)
Segment Any 4D Gaussians
by: Ji, Shengxiang, et al.
Published: (2024)
by: Ji, Shengxiang, et al.
Published: (2024)
Auto-Vocabulary Semantic Segmentation
by: Ülger, Osman, et al.
Published: (2023)
by: Ülger, Osman, et al.
Published: (2023)
Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis
by: Lilova, Valentina, et al.
Published: (2025)
by: Lilova, Valentina, et al.
Published: (2025)
Similar Items
-
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs
by: Dorkenwald, Michael, et al.
Published: (2024) -
Elastic ViTs from Pretrained Models without Retraining
by: Simoncini, Walter, et al.
Published: (2025) -
Redefining Normal: A Novel Object-Level Approach for Multi-Object Novelty Detection
by: Salehi, Mohammadreza, et al.
Published: (2024) -
Lost in Time: A New Temporal Benchmark for VideoLLMs
by: Cores, Daniel, et al.
Published: (2024) -
MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning
by: Salehi, Mohammadreza, et al.
Published: (2025)