Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhao, Mingjie, Zhang, Yunfan, Zhang, Yiqun, Cheung, Yiu-ming
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.10865
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911586834186240
author	Zhao, Mingjie Zhang, Yunfan Zhang, Yiqun Cheung, Yiu-ming
author_facet	Zhao, Mingjie Zhang, Yunfan Zhang, Yiqun Cheung, Yiu-ming
contents	Deep Clustering (DC) has emerged as a powerful tool for tabular data analysis in real-world domains like finance and healthcare. However, most existing methods rely on data-level statistical co-occurrence to infer the latent metric space, often overlooking the intrinsic semantic knowledge encapsulated in feature names and values. As a result, semantically related concepts like `Flu' and `Cold' are often treated as symbolic tokens, causing conceptually related samples to be isolated. To bridge the gap between dataset-specific statistics and intrinsic semantic knowledge, this paper proposes Tabular-Augmented Contrastive Clustering (TagCC), a novel framework that anchors statistical tabular representations to open-world textual concepts. Specifically, TagCC utilizes Large Language Models (LLMs) to distill underlying data semantics into textual anchors via semantic-aware transformation. Through Contrastive Learning (CL), the framework enriches the statistical tabular representations with the open-world semantics encapsulated in these anchors. This CL framework is jointly optimized with a clustering objective, ensuring that the learned representations are both semantically coherent and clustering-friendly. Extensive experiments on benchmark datasets demonstrate that TagCC significantly outperforms its counterparts.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_10865
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Beyond Statistical Co-occurrence: Unlocking Intrinsic Semantics for Tabular Data Clustering Zhao, Mingjie Zhang, Yunfan Zhang, Yiqun Cheung, Yiu-ming Artificial Intelligence Deep Clustering (DC) has emerged as a powerful tool for tabular data analysis in real-world domains like finance and healthcare. However, most existing methods rely on data-level statistical co-occurrence to infer the latent metric space, often overlooking the intrinsic semantic knowledge encapsulated in feature names and values. As a result, semantically related concepts like `Flu' and `Cold' are often treated as symbolic tokens, causing conceptually related samples to be isolated. To bridge the gap between dataset-specific statistics and intrinsic semantic knowledge, this paper proposes Tabular-Augmented Contrastive Clustering (TagCC), a novel framework that anchors statistical tabular representations to open-world textual concepts. Specifically, TagCC utilizes Large Language Models (LLMs) to distill underlying data semantics into textual anchors via semantic-aware transformation. Through Contrastive Learning (CL), the framework enriches the statistical tabular representations with the open-world semantics encapsulated in these anchors. This CL framework is jointly optimized with a clustering objective, ensuring that the learned representations are both semantically coherent and clustering-friendly. Extensive experiments on benchmark datasets demonstrate that TagCC significantly outperforms its counterparts.
title	Beyond Statistical Co-occurrence: Unlocking Intrinsic Semantics for Tabular Data Clustering
topic	Artificial Intelligence
url	https://arxiv.org/abs/2604.10865

Similar Items