Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.22666 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866911410763595776 |
|---|---|
| author | Hu, Junyi Bai, Tian Wu, Fengyi Li, Wenyan Peng, Zhenming Zhang, Yi |
| author_facet | Hu, Junyi Bai, Tian Wu, Fengyi Li, Wenyan Peng, Zhenming Zhang, Yi |
| contents | Open-vocabulary grounding requires accurate vision-language alignment under weak supervision, yet existing methods either rely on global sentence embeddings that lack fine-grained expressiveness or introduce token-level alignment with explicit supervision or heavy cross-attention designs. We propose ExpAlign, a theoretically grounded vision-language alignment framework built on a principled multiple instance learning formulation. ExpAlign introduces an Expectation Alignment Head that performs attention-based soft MIL pooling over token-region similarities, enabling implicit token and instance selection without additional annotations. To further stabilize alignment learning, we develop an energy-based multi-scale consistency regularization scheme, including a Top-K multi-positive contrastive objective and a Geometry-Aware Consistency Objective derived from a Lagrangian-constrained free-energy minimization. Extensive experiments show that ExpAlign consistently improves open-vocabulary detection and zero-shot instance segmentation, particularly on long-tail categories. Most notably, it achieves 36.2 AP$_r$ on the LVIS minival split, outperforming other state-of-the-art methods at comparable model scale, while remaining lightweight and inference-efficient. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2601_22666 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | ExpAlign: Expectation-Guided Vision-Language Alignment for Open-Vocabulary Grounding Hu, Junyi Bai, Tian Wu, Fengyi Li, Wenyan Peng, Zhenming Zhang, Yi Computer Vision and Pattern Recognition Open-vocabulary grounding requires accurate vision-language alignment under weak supervision, yet existing methods either rely on global sentence embeddings that lack fine-grained expressiveness or introduce token-level alignment with explicit supervision or heavy cross-attention designs. We propose ExpAlign, a theoretically grounded vision-language alignment framework built on a principled multiple instance learning formulation. ExpAlign introduces an Expectation Alignment Head that performs attention-based soft MIL pooling over token-region similarities, enabling implicit token and instance selection without additional annotations. To further stabilize alignment learning, we develop an energy-based multi-scale consistency regularization scheme, including a Top-K multi-positive contrastive objective and a Geometry-Aware Consistency Objective derived from a Lagrangian-constrained free-energy minimization. Extensive experiments show that ExpAlign consistently improves open-vocabulary detection and zero-shot instance segmentation, particularly on long-tail categories. Most notably, it achieves 36.2 AP$_r$ on the LVIS minival split, outperforming other state-of-the-art methods at comparable model scale, while remaining lightweight and inference-efficient. |
| title | ExpAlign: Expectation-Guided Vision-Language Alignment for Open-Vocabulary Grounding |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2601.22666 |