Saved in:
Bibliographic Details
Main Authors: Zhao, Mingjie, Zhang, Yunfan, Zhang, Yiqun, Cheung, Yiu-ming
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.10865
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911586834186240
author Zhao, Mingjie
Zhang, Yunfan
Zhang, Yiqun
Cheung, Yiu-ming
author_facet Zhao, Mingjie
Zhang, Yunfan
Zhang, Yiqun
Cheung, Yiu-ming
contents Deep Clustering (DC) has emerged as a powerful tool for tabular data analysis in real-world domains like finance and healthcare. However, most existing methods rely on data-level statistical co-occurrence to infer the latent metric space, often overlooking the intrinsic semantic knowledge encapsulated in feature names and values. As a result, semantically related concepts like `Flu' and `Cold' are often treated as symbolic tokens, causing conceptually related samples to be isolated. To bridge the gap between dataset-specific statistics and intrinsic semantic knowledge, this paper proposes Tabular-Augmented Contrastive Clustering (TagCC), a novel framework that anchors statistical tabular representations to open-world textual concepts. Specifically, TagCC utilizes Large Language Models (LLMs) to distill underlying data semantics into textual anchors via semantic-aware transformation. Through Contrastive Learning (CL), the framework enriches the statistical tabular representations with the open-world semantics encapsulated in these anchors. This CL framework is jointly optimized with a clustering objective, ensuring that the learned representations are both semantically coherent and clustering-friendly. Extensive experiments on benchmark datasets demonstrate that TagCC significantly outperforms its counterparts.
format Preprint
id arxiv_https___arxiv_org_abs_2604_10865
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Beyond Statistical Co-occurrence: Unlocking Intrinsic Semantics for Tabular Data Clustering
Zhao, Mingjie
Zhang, Yunfan
Zhang, Yiqun
Cheung, Yiu-ming
Artificial Intelligence
Deep Clustering (DC) has emerged as a powerful tool for tabular data analysis in real-world domains like finance and healthcare. However, most existing methods rely on data-level statistical co-occurrence to infer the latent metric space, often overlooking the intrinsic semantic knowledge encapsulated in feature names and values. As a result, semantically related concepts like `Flu' and `Cold' are often treated as symbolic tokens, causing conceptually related samples to be isolated. To bridge the gap between dataset-specific statistics and intrinsic semantic knowledge, this paper proposes Tabular-Augmented Contrastive Clustering (TagCC), a novel framework that anchors statistical tabular representations to open-world textual concepts. Specifically, TagCC utilizes Large Language Models (LLMs) to distill underlying data semantics into textual anchors via semantic-aware transformation. Through Contrastive Learning (CL), the framework enriches the statistical tabular representations with the open-world semantics encapsulated in these anchors. This CL framework is jointly optimized with a clustering objective, ensuring that the learned representations are both semantically coherent and clustering-friendly. Extensive experiments on benchmark datasets demonstrate that TagCC significantly outperforms its counterparts.
title Beyond Statistical Co-occurrence: Unlocking Intrinsic Semantics for Tabular Data Clustering
topic Artificial Intelligence
url https://arxiv.org/abs/2604.10865