Saved in:
Bibliographic Details
Main Authors: Petnehazi, Gabor, Aradi, Bernadett
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2506.19992
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918134697426944
author Petnehazi, Gabor
Aradi, Bernadett
author_facet Petnehazi, Gabor
Aradi, Bernadett
contents The explosive growth of complex datasets across various modalities necessitates advanced analytical tools that not only group data effectively but also provide human-understandable insights into the discovered structures. We introduce HERCULES (Hierarchical Embedding-based Recursive Clustering Using LLMs for Efficient Summarization), a novel algorithm and Python package designed for hierarchical k-means clustering of diverse data types, including text, images, and numeric data (processed one modality per run). HERCULES constructs a cluster hierarchy by recursively applying k-means clustering, starting from individual data points at level 0. A key innovation is its deep integration of Large Language Models (LLMs) to generate semantically rich titles and descriptions for clusters at each level of the hierarchy, significantly enhancing interpretability. The algorithm supports two main representation modes: `direct' mode, which clusters based on original data embeddings or scaled numeric features, and `description' mode, which clusters based on embeddings derived from LLM-generated summaries. Users can provide a `topic\_seed' to guide LLM-generated summaries towards specific themes. An interactive visualization tool facilitates thorough analysis and understanding of the clustering results. We demonstrate HERCULES's capabilities and discuss its potential for extracting meaningful, hierarchical knowledge from complex datasets.
format Preprint
id arxiv_https___arxiv_org_abs_2506_19992
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle HERCULES: Hierarchical Embedding-based Recursive Clustering Using LLMs for Efficient Summarization
Petnehazi, Gabor
Aradi, Bernadett
Machine Learning
Artificial Intelligence
The explosive growth of complex datasets across various modalities necessitates advanced analytical tools that not only group data effectively but also provide human-understandable insights into the discovered structures. We introduce HERCULES (Hierarchical Embedding-based Recursive Clustering Using LLMs for Efficient Summarization), a novel algorithm and Python package designed for hierarchical k-means clustering of diverse data types, including text, images, and numeric data (processed one modality per run). HERCULES constructs a cluster hierarchy by recursively applying k-means clustering, starting from individual data points at level 0. A key innovation is its deep integration of Large Language Models (LLMs) to generate semantically rich titles and descriptions for clusters at each level of the hierarchy, significantly enhancing interpretability. The algorithm supports two main representation modes: `direct' mode, which clusters based on original data embeddings or scaled numeric features, and `description' mode, which clusters based on embeddings derived from LLM-generated summaries. Users can provide a `topic\_seed' to guide LLM-generated summaries towards specific themes. An interactive visualization tool facilitates thorough analysis and understanding of the clustering results. We demonstrate HERCULES's capabilities and discuss its potential for extracting meaningful, hierarchical knowledge from complex datasets.
title HERCULES: Hierarchical Embedding-based Recursive Clustering Using LLMs for Efficient Summarization
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2506.19992