Enregistré dans:
Détails bibliographiques
Auteurs principaux: Liu, Kaiwen, Villalobos, Seba Daniela, Zhang, Qin
Format: Preprint
Publié: 2026
Sujets:
Accès en ligne:https://arxiv.org/abs/2605.07091
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866918489129746432
author Liu, Kaiwen
Villalobos, Seba Daniela
Zhang, Qin
author_facet Liu, Kaiwen
Villalobos, Seba Daniela
Zhang, Qin
contents We study the correlation clustering problem in the node-arrival data stream model. Unlike previous work, where the stream consists of the graph's edges, we focus on the setting in which the stream contains only the nodes. This model better reflects many real-world scenarios in which the data stream naturally consists of raw objects (e.g., images, tweets), and the similar/dissimilar edges are derived through a similarity function. We present C$^4$Approx, a streaming algorithm that approximates the cost of correlation clustering using sublinear space in the number of nodes and a constant number of passes. We further complement this result with lower bounds. Experiments on real-world datasets show that by storing only 2% of the nodes, our algorithm achieves performance comparable to the classic Pivot algorithm and the more recent PrunedPivot algorithm, even on sparse graphs.
format Preprint
id arxiv_https___arxiv_org_abs_2605_07091
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Estimating Correlation Clustering Cost in Node-Arrival Stream
Liu, Kaiwen
Villalobos, Seba Daniela
Zhang, Qin
Data Structures and Algorithms
We study the correlation clustering problem in the node-arrival data stream model. Unlike previous work, where the stream consists of the graph's edges, we focus on the setting in which the stream contains only the nodes. This model better reflects many real-world scenarios in which the data stream naturally consists of raw objects (e.g., images, tweets), and the similar/dissimilar edges are derived through a similarity function. We present C$^4$Approx, a streaming algorithm that approximates the cost of correlation clustering using sublinear space in the number of nodes and a constant number of passes. We further complement this result with lower bounds. Experiments on real-world datasets show that by storing only 2% of the nodes, our algorithm achieves performance comparable to the classic Pivot algorithm and the more recent PrunedPivot algorithm, even on sparse graphs.
title Estimating Correlation Clustering Cost in Node-Arrival Stream
topic Data Structures and Algorithms
url https://arxiv.org/abs/2605.07091