Guardado en:
Detalles Bibliográficos
Autores principales: Tian, Pinzhuo, Yang, Shengjie, Yu, Hang, Kot, Alex C.
Formato: Preprint
Publicado: 2025
Materias:
Acceso en línea:https://arxiv.org/abs/2509.02032
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866912566397108224
author Tian, Pinzhuo
Yang, Shengjie
Yu, Hang
Kot, Alex C.
author_facet Tian, Pinzhuo
Yang, Shengjie
Yu, Hang
Kot, Alex C.
contents A key human ability is to decompose a scene into distinct objects and use their relationships to understand the environment. Object-centric learning aims to mimic this process in an unsupervised manner. Recently, the slot attention-based framework has emerged as a leading approach in this area and has been widely used in various downstream tasks. However, existing slot attention methods face two key limitations: (1) a lack of high-level semantic information. In current methods, image areas are assigned to slots based on low-level features such as color and texture. This makes the model overly sensitive to low-level features and limits its understanding of object contours, shapes, or other semantic characteristics. (2) The inability to fine-tune the encoder. Current methods require a stable feature space throughout training to enable reconstruction from slots, which restricts the flexibility needed for effective object-centric learning. To address these limitations, we propose a novel ContextFusion stage and a Bootstrap Branch, both of which can be seamlessly integrated into existing slot attention models. In the ContextFusion stage, we exploit semantic information from the foreground and background, incorporating an auxiliary indicator that provides additional contextual cues about them to enrich the semantic content beyond low-level features. In the Bootstrap Branch, we decouple feature adaptation from the original reconstruction phase and introduce a bootstrap strategy to train a feature-adaptive mechanism, allowing for more flexible adaptation. Experimental results show that our method significantly improves the performance of different SOTA slot attention models on both simulated and real-world datasets.
format Preprint
id arxiv_https___arxiv_org_abs_2509_02032
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle ContextFusion and Bootstrap: An Effective Approach to Improve Slot Attention-Based Object-Centric Learning
Tian, Pinzhuo
Yang, Shengjie
Yu, Hang
Kot, Alex C.
Computer Vision and Pattern Recognition
A key human ability is to decompose a scene into distinct objects and use their relationships to understand the environment. Object-centric learning aims to mimic this process in an unsupervised manner. Recently, the slot attention-based framework has emerged as a leading approach in this area and has been widely used in various downstream tasks. However, existing slot attention methods face two key limitations: (1) a lack of high-level semantic information. In current methods, image areas are assigned to slots based on low-level features such as color and texture. This makes the model overly sensitive to low-level features and limits its understanding of object contours, shapes, or other semantic characteristics. (2) The inability to fine-tune the encoder. Current methods require a stable feature space throughout training to enable reconstruction from slots, which restricts the flexibility needed for effective object-centric learning. To address these limitations, we propose a novel ContextFusion stage and a Bootstrap Branch, both of which can be seamlessly integrated into existing slot attention models. In the ContextFusion stage, we exploit semantic information from the foreground and background, incorporating an auxiliary indicator that provides additional contextual cues about them to enrich the semantic content beyond low-level features. In the Bootstrap Branch, we decouple feature adaptation from the original reconstruction phase and introduce a bootstrap strategy to train a feature-adaptive mechanism, allowing for more flexible adaptation. Experimental results show that our method significantly improves the performance of different SOTA slot attention models on both simulated and real-world datasets.
title ContextFusion and Bootstrap: An Effective Approach to Improve Slot Attention-Based Object-Centric Learning
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2509.02032