Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Tian, Pinzhuo, Yang, Shengjie, Yu, Hang, Kot, Alex C.
Formato:	Preprint
Publicado:	2025
Materias:	Computer Vision and Pattern Recognition
Acceso en línea:	https://arxiv.org/abs/2509.02032
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866912566397108224
author	Tian, Pinzhuo Yang, Shengjie Yu, Hang Kot, Alex C.
author_facet	Tian, Pinzhuo Yang, Shengjie Yu, Hang Kot, Alex C.
contents	A key human ability is to decompose a scene into distinct objects and use their relationships to understand the environment. Object-centric learning aims to mimic this process in an unsupervised manner. Recently, the slot attention-based framework has emerged as a leading approach in this area and has been widely used in various downstream tasks. However, existing slot attention methods face two key limitations: (1) a lack of high-level semantic information. In current methods, image areas are assigned to slots based on low-level features such as color and texture. This makes the model overly sensitive to low-level features and limits its understanding of object contours, shapes, or other semantic characteristics. (2) The inability to fine-tune the encoder. Current methods require a stable feature space throughout training to enable reconstruction from slots, which restricts the flexibility needed for effective object-centric learning. To address these limitations, we propose a novel ContextFusion stage and a Bootstrap Branch, both of which can be seamlessly integrated into existing slot attention models. In the ContextFusion stage, we exploit semantic information from the foreground and background, incorporating an auxiliary indicator that provides additional contextual cues about them to enrich the semantic content beyond low-level features. In the Bootstrap Branch, we decouple feature adaptation from the original reconstruction phase and introduce a bootstrap strategy to train a feature-adaptive mechanism, allowing for more flexible adaptation. Experimental results show that our method significantly improves the performance of different SOTA slot attention models on both simulated and real-world datasets.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_02032
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	ContextFusion and Bootstrap: An Effective Approach to Improve Slot Attention-Based Object-Centric Learning Tian, Pinzhuo Yang, Shengjie Yu, Hang Kot, Alex C. Computer Vision and Pattern Recognition A key human ability is to decompose a scene into distinct objects and use their relationships to understand the environment. Object-centric learning aims to mimic this process in an unsupervised manner. Recently, the slot attention-based framework has emerged as a leading approach in this area and has been widely used in various downstream tasks. However, existing slot attention methods face two key limitations: (1) a lack of high-level semantic information. In current methods, image areas are assigned to slots based on low-level features such as color and texture. This makes the model overly sensitive to low-level features and limits its understanding of object contours, shapes, or other semantic characteristics. (2) The inability to fine-tune the encoder. Current methods require a stable feature space throughout training to enable reconstruction from slots, which restricts the flexibility needed for effective object-centric learning. To address these limitations, we propose a novel ContextFusion stage and a Bootstrap Branch, both of which can be seamlessly integrated into existing slot attention models. In the ContextFusion stage, we exploit semantic information from the foreground and background, incorporating an auxiliary indicator that provides additional contextual cues about them to enrich the semantic content beyond low-level features. In the Bootstrap Branch, we decouple feature adaptation from the original reconstruction phase and introduce a bootstrap strategy to train a feature-adaptive mechanism, allowing for more flexible adaptation. Experimental results show that our method significantly improves the performance of different SOTA slot attention models on both simulated and real-world datasets.
title	ContextFusion and Bootstrap: An Effective Approach to Improve Slot Attention-Based Object-Centric Learning
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2509.02032

Ejemplares similares