Saved in:
Bibliographic Details
Main Authors: Okano, Ryo, Imaizumi, Masaaki
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2407.08228
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912443380269056
author Okano, Ryo
Imaizumi, Masaaki
author_facet Okano, Ryo
Imaizumi, Masaaki
contents We develop a novel clustering method for distributional data, where each data point is regarded as a probability distribution on the real line. For distributional data, it has been challenging to develop a clustering method that utilizes modes of variation of the data because the space of probability distributions lacks a vector space structure, preventing the application of existing methods devised for functional data. Our clustering method for distributional data takes account of the differences in both means and modes of variation of clusters, in the spirit of the $k$-centers clustering approach proposed for functional data. Specifically, we consider the space of distributions equipped with the Wasserstein metric and define geodesic modes of variation of distributional data using the notion of geodesic principal component analysis. Then, we utilize geodesic modes of clusters to predict the cluster membership of each distribution. We theoretically show the validity of the proposed clustering criterion by studying the probability of correct membership. Through a simulation study and real data application, we demonstrate that the proposed distributional clustering method can improve the quality of the cluster compared to conventional clustering algorithms.
format Preprint
id arxiv_https___arxiv_org_abs_2407_08228
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Wasserstein $k$-Centers Clustering for Distributional Data
Okano, Ryo
Imaizumi, Masaaki
Methodology
We develop a novel clustering method for distributional data, where each data point is regarded as a probability distribution on the real line. For distributional data, it has been challenging to develop a clustering method that utilizes modes of variation of the data because the space of probability distributions lacks a vector space structure, preventing the application of existing methods devised for functional data. Our clustering method for distributional data takes account of the differences in both means and modes of variation of clusters, in the spirit of the $k$-centers clustering approach proposed for functional data. Specifically, we consider the space of distributions equipped with the Wasserstein metric and define geodesic modes of variation of distributional data using the notion of geodesic principal component analysis. Then, we utilize geodesic modes of clusters to predict the cluster membership of each distribution. We theoretically show the validity of the proposed clustering criterion by studying the probability of correct membership. Through a simulation study and real data application, we demonstrate that the proposed distributional clustering method can improve the quality of the cluster compared to conventional clustering algorithms.
title Wasserstein $k$-Centers Clustering for Distributional Data
topic Methodology
url https://arxiv.org/abs/2407.08228