Enregistré dans:
Détails bibliographiques
Auteurs principaux: Huang, Yucheng, Ji, Luping, Jiang, Xiangwei, Li, Wen, Ye, Mao
Format: Preprint
Publié: 2026
Sujets:
Accès en ligne:https://arxiv.org/abs/2603.28178
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866913046952148992
author Huang, Yucheng
Ji, Luping
Jiang, Xiangwei
Li, Wen
Ye, Mao
author_facet Huang, Yucheng
Ji, Luping
Jiang, Xiangwei
Li, Wen
Ye, Mao
contents 3D Scene Graph (3DSG) generation plays a pivotal role in spatial understanding and affordance perception. To mitigate generalization issues from data scarcity, joint-embedding and generative proxy tasks are proposed to pre-train 3DSG representations on predicate label-free datasets. Currently, generative pre-training usually bypasses the semantic corruption caused by the geometric augmentations in joint-embedding, but cannot avoid a negative problem ``Geometric Shortcut." In this problem, exposing dense object spatial and scale priors will induce models to trivially reconstruct scenes by interpolating object positions, rather than learning the underlying topological constraints provided by edges. To address this issue, we propose a Topological Layout Learning (ToLL) for 3DSG generation pretraining framework. In detail, we design an Anchor-Conditioned Topological Geometry Reasoning. It adopts a recurrent GNN to recover the global layout of zero-centered subgraphs (the non-visible spatial features) by one anchor with sparse spatial prior. Considering the absence of spatial layout information within the objects, it creates an information bottleneck, compelling our model to recover the full scene layout by leveraging predicate representation learning. Moreover, we construct a Structural Multi-view Augmentation to avoid semantic corruption, enhancing 3DSG representations via self-distillation. The extensive experiments on special dataset demonstrate that our ToLL could often improve 3DSG pertaining quality, outperforming state-of-the-art baselines.
format Preprint
id arxiv_https___arxiv_org_abs_2603_28178
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle ToLL: Topological Layout Learning with Asymmetric Cross-View Structural Distillation for 3D Scene Graph Generation Pretraining
Huang, Yucheng
Ji, Luping
Jiang, Xiangwei
Li, Wen
Ye, Mao
Computer Vision and Pattern Recognition
3D Scene Graph (3DSG) generation plays a pivotal role in spatial understanding and affordance perception. To mitigate generalization issues from data scarcity, joint-embedding and generative proxy tasks are proposed to pre-train 3DSG representations on predicate label-free datasets. Currently, generative pre-training usually bypasses the semantic corruption caused by the geometric augmentations in joint-embedding, but cannot avoid a negative problem ``Geometric Shortcut." In this problem, exposing dense object spatial and scale priors will induce models to trivially reconstruct scenes by interpolating object positions, rather than learning the underlying topological constraints provided by edges. To address this issue, we propose a Topological Layout Learning (ToLL) for 3DSG generation pretraining framework. In detail, we design an Anchor-Conditioned Topological Geometry Reasoning. It adopts a recurrent GNN to recover the global layout of zero-centered subgraphs (the non-visible spatial features) by one anchor with sparse spatial prior. Considering the absence of spatial layout information within the objects, it creates an information bottleneck, compelling our model to recover the full scene layout by leveraging predicate representation learning. Moreover, we construct a Structural Multi-view Augmentation to avoid semantic corruption, enhancing 3DSG representations via self-distillation. The extensive experiments on special dataset demonstrate that our ToLL could often improve 3DSG pertaining quality, outperforming state-of-the-art baselines.
title ToLL: Topological Layout Learning with Asymmetric Cross-View Structural Distillation for 3D Scene Graph Generation Pretraining
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2603.28178