Saved in:
Bibliographic Details
Main Authors: Ye, Jiaojiao, Zhong, Jiaxing, Xie, Qian, Zhou, Yuzhou, Trigoni, Niki, Markham, Andrew
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.05722
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915535865774080
author Ye, Jiaojiao
Zhong, Jiaxing
Xie, Qian
Zhou, Yuzhou
Trigoni, Niki
Markham, Andrew
author_facet Ye, Jiaojiao
Zhong, Jiaxing
Xie, Qian
Zhou, Yuzhou
Trigoni, Niki
Markham, Andrew
contents Generating enough and diverse data through augmentation offers an efficient solution to the time-consuming and labour-intensive process of collecting and annotating pixel-wise images. Traditional data augmentation techniques often face challenges in manipulating high-level semantic attributes, such as materials and textures. In contrast, diffusion models offer a robust alternative, by effectively utilizing text-to-image or image-to-image transformation. However, existing diffusion-based methods are either computationally expensive or compromise on performance. To address this issue, we introduce a novel training-free pipeline that integrates pretrained ControlNet and Vision-Language Models (VLMs) to generate synthetic images paired with pixel-level labels. This approach eliminates the need for manual annotations and significantly improves downstream tasks. To improve the fidelity and diversity, we add a Multi-way Prompt Generator, Mask Generator and High-quality Image Selection module. Our results on PASCAL-5i and COCO-20i present promising performance and outperform concurrent work for one-shot semantic segmentation.
format Preprint
id arxiv_https___arxiv_org_abs_2510_05722
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Data Factory with Minimal Human Effort Using VLMs
Ye, Jiaojiao
Zhong, Jiaxing
Xie, Qian
Zhou, Yuzhou
Trigoni, Niki
Markham, Andrew
Computer Vision and Pattern Recognition
Generating enough and diverse data through augmentation offers an efficient solution to the time-consuming and labour-intensive process of collecting and annotating pixel-wise images. Traditional data augmentation techniques often face challenges in manipulating high-level semantic attributes, such as materials and textures. In contrast, diffusion models offer a robust alternative, by effectively utilizing text-to-image or image-to-image transformation. However, existing diffusion-based methods are either computationally expensive or compromise on performance. To address this issue, we introduce a novel training-free pipeline that integrates pretrained ControlNet and Vision-Language Models (VLMs) to generate synthetic images paired with pixel-level labels. This approach eliminates the need for manual annotations and significantly improves downstream tasks. To improve the fidelity and diversity, we add a Multi-way Prompt Generator, Mask Generator and High-quality Image Selection module. Our results on PASCAL-5i and COCO-20i present promising performance and outperform concurrent work for one-shot semantic segmentation.
title Data Factory with Minimal Human Effort Using VLMs
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2510.05722