Saved in:
Bibliographic Details
Main Authors: Desai, Radhika Amar, Narendra, Modigari
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.09697
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914556607987712
author Desai, Radhika Amar
Narendra, Modigari
author_facet Desai, Radhika Amar
Narendra, Modigari
contents In many real-world computer vision applications, including medical imaging and industrial inspection, binary classification tasks are characterized by a severe scarcity of positive samples. A widely adopted solution is to generate synthetic positive data using image-to-image transformations applied to negative samples. However, a fundamental challenge remains: how can we reliably assess whether such synthetic data will improve downstream model performance? In this work, we propose a geometry-driven metric that predicts the utility of synthetic data without requiring model training. Our approach operates in the embedding space of a pre-trained foundation model and represents the dataset through difference vectors between samples. We evaluate whether the weight vector of a linear classifier can be expressed within the subspace spanned by these variations by measuring the relative projection error. Intuitively, if the variations induced by synthetic data capture task-relevant directions, their span can approximate the classifier, resulting in low projection error. Conversely, poor synthetic data fails to span these directions, leading to higher error. Across multiple datasets and architectures, we show that this metric exhibits strong correlation with downstream classification performance of CNNs trained on mixtures of real negative and synthetic positive data. These findings suggest that the proposed metric serves as a practical and informative tool for evaluating synthetic data quality in data-scarce settings.
format Preprint
id arxiv_https___arxiv_org_abs_2605_09697
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Discriminative Span as a Predictor of Synthetic Data Utility via Classifier Reconstruction
Desai, Radhika Amar
Narendra, Modigari
Computer Vision and Pattern Recognition
Machine Learning
In many real-world computer vision applications, including medical imaging and industrial inspection, binary classification tasks are characterized by a severe scarcity of positive samples. A widely adopted solution is to generate synthetic positive data using image-to-image transformations applied to negative samples. However, a fundamental challenge remains: how can we reliably assess whether such synthetic data will improve downstream model performance? In this work, we propose a geometry-driven metric that predicts the utility of synthetic data without requiring model training. Our approach operates in the embedding space of a pre-trained foundation model and represents the dataset through difference vectors between samples. We evaluate whether the weight vector of a linear classifier can be expressed within the subspace spanned by these variations by measuring the relative projection error. Intuitively, if the variations induced by synthetic data capture task-relevant directions, their span can approximate the classifier, resulting in low projection error. Conversely, poor synthetic data fails to span these directions, leading to higher error. Across multiple datasets and architectures, we show that this metric exhibits strong correlation with downstream classification performance of CNNs trained on mixtures of real negative and synthetic positive data. These findings suggest that the proposed metric serves as a practical and informative tool for evaluating synthetic data quality in data-scarce settings.
title Discriminative Span as a Predictor of Synthetic Data Utility via Classifier Reconstruction
topic Computer Vision and Pattern Recognition
Machine Learning
url https://arxiv.org/abs/2605.09697