Saved in:
| Main Authors: | , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.10048 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913792407896064 |
|---|---|
| author | Du, Chengyi Jin, Keyan |
| author_facet | Du, Chengyi Jin, Keyan |
| contents | Multi-object grounding in 3D scenes involves localizing multiple objects based on natural language input. While previous work has primarily focused on single-object grounding, real-world scenarios often demand the localization of several objects. To tackle this challenge, we propose Hierarchical Contrastive Siamese Transformers (H-COST), which employs a Hierarchical Processing strategy to progressively refine object localization, enhancing the understanding of complex language instructions. Additionally, we introduce a Contrastive Siamese Transformer framework, where two networks with the identical structure are used: one auxiliary network processes robust object relations from ground-truth labels to guide and enhance the second network, the reference network, which operates on segmented point-cloud data. This contrastive mechanism strengthens the model' s semantic understanding and significantly enhances its ability to process complex point-cloud data. Our approach outperforms previous state-of-the-art methods by 9.5% on challenging multi-object grounding benchmarks. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2504_10048 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Multi-Object Grounding via Hierarchical Contrastive Siamese Transformers Du, Chengyi Jin, Keyan Computer Vision and Pattern Recognition Multi-object grounding in 3D scenes involves localizing multiple objects based on natural language input. While previous work has primarily focused on single-object grounding, real-world scenarios often demand the localization of several objects. To tackle this challenge, we propose Hierarchical Contrastive Siamese Transformers (H-COST), which employs a Hierarchical Processing strategy to progressively refine object localization, enhancing the understanding of complex language instructions. Additionally, we introduce a Contrastive Siamese Transformer framework, where two networks with the identical structure are used: one auxiliary network processes robust object relations from ground-truth labels to guide and enhance the second network, the reference network, which operates on segmented point-cloud data. This contrastive mechanism strengthens the model' s semantic understanding and significantly enhances its ability to process complex point-cloud data. Our approach outperforms previous state-of-the-art methods by 9.5% on challenging multi-object grounding benchmarks. |
| title | Multi-Object Grounding via Hierarchical Contrastive Siamese Transformers |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2504.10048 |