Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Chuang, Yao, Zelin, Ma, Xueqi, Wang, Luzhi, Chen, Mukun, Xu, Pinghua, Hu, Wenbin
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.01310
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911640394399744
author	Liu, Chuang Yao, Zelin Ma, Xueqi Wang, Luzhi Chen, Mukun Xu, Pinghua Hu, Wenbin
author_facet	Liu, Chuang Yao, Zelin Ma, Xueqi Wang, Luzhi Chen, Mukun Xu, Pinghua Hu, Wenbin
contents	Graph self-supervised learning typically relies on large-scale unlabeled datasets, heavily inflating computational costs. However, empirical evidence suggests that these datasets contain substantial redundancy-our analysis reveals that uniformly subsampling 50% of graphs retains over 96% of downstream performance. To exploit this redundancy, we introduce GraphSculptor for pre-training coreset construction. Unlike methods dependent on additional training-time signals or limited solely to topological statistics, GraphSculptor provides a label-free solution that constructs coresets via two complementary perspectives: intrinsic structure and contextual semantics. Concretely, structural diversity is quantified using intrinsic graph statistics, yielding a structural feature vector for each graph, while semantic diversity is captured by utilizing a pre-trained language model to encode descriptions generated via graph-to-text. GraphSculptor integrates these signals into a unified metric space and performs cluster-aware selection to preserve joint structural-semantic diversity. We further derive a theoretical bound on the loss gap between coreset and full-data pre-training, offering theoretical motivation for our selection formulation. Extensive experiments demonstrate that GraphSculptor effectively sculpts the dataset: a 10% coreset achieves 99.6% of full-data performance while reducing pre-training time by nearly 90%, offering a scalable solution for data-efficient graph pre-training.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_01310
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	GraphSculptor: Sculpting Pre-training Coreset for Graph Self-supervised Learning Liu, Chuang Yao, Zelin Ma, Xueqi Wang, Luzhi Chen, Mukun Xu, Pinghua Hu, Wenbin Machine Learning Artificial Intelligence Graph self-supervised learning typically relies on large-scale unlabeled datasets, heavily inflating computational costs. However, empirical evidence suggests that these datasets contain substantial redundancy-our analysis reveals that uniformly subsampling 50% of graphs retains over 96% of downstream performance. To exploit this redundancy, we introduce GraphSculptor for pre-training coreset construction. Unlike methods dependent on additional training-time signals or limited solely to topological statistics, GraphSculptor provides a label-free solution that constructs coresets via two complementary perspectives: intrinsic structure and contextual semantics. Concretely, structural diversity is quantified using intrinsic graph statistics, yielding a structural feature vector for each graph, while semantic diversity is captured by utilizing a pre-trained language model to encode descriptions generated via graph-to-text. GraphSculptor integrates these signals into a unified metric space and performs cluster-aware selection to preserve joint structural-semantic diversity. We further derive a theoretical bound on the loss gap between coreset and full-data pre-training, offering theoretical motivation for our selection formulation. Extensive experiments demonstrate that GraphSculptor effectively sculpts the dataset: a 10% coreset achieves 99.6% of full-data performance while reducing pre-training time by nearly 90%, offering a scalable solution for data-efficient graph pre-training.
title	GraphSculptor: Sculpting Pre-training Coreset for Graph Self-supervised Learning
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2605.01310

Similar Items