Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ye, Jiaojiao, Zhong, Jiaxing, Xie, Qian, Zhou, Yuzhou, Trigoni, Niki, Markham, Andrew
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2510.05722
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915535865774080
author	Ye, Jiaojiao Zhong, Jiaxing Xie, Qian Zhou, Yuzhou Trigoni, Niki Markham, Andrew
author_facet	Ye, Jiaojiao Zhong, Jiaxing Xie, Qian Zhou, Yuzhou Trigoni, Niki Markham, Andrew
contents	Generating enough and diverse data through augmentation offers an efficient solution to the time-consuming and labour-intensive process of collecting and annotating pixel-wise images. Traditional data augmentation techniques often face challenges in manipulating high-level semantic attributes, such as materials and textures. In contrast, diffusion models offer a robust alternative, by effectively utilizing text-to-image or image-to-image transformation. However, existing diffusion-based methods are either computationally expensive or compromise on performance. To address this issue, we introduce a novel training-free pipeline that integrates pretrained ControlNet and Vision-Language Models (VLMs) to generate synthetic images paired with pixel-level labels. This approach eliminates the need for manual annotations and significantly improves downstream tasks. To improve the fidelity and diversity, we add a Multi-way Prompt Generator, Mask Generator and High-quality Image Selection module. Our results on PASCAL-5i and COCO-20i present promising performance and outperform concurrent work for one-shot semantic segmentation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_05722
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Data Factory with Minimal Human Effort Using VLMs Ye, Jiaojiao Zhong, Jiaxing Xie, Qian Zhou, Yuzhou Trigoni, Niki Markham, Andrew Computer Vision and Pattern Recognition Generating enough and diverse data through augmentation offers an efficient solution to the time-consuming and labour-intensive process of collecting and annotating pixel-wise images. Traditional data augmentation techniques often face challenges in manipulating high-level semantic attributes, such as materials and textures. In contrast, diffusion models offer a robust alternative, by effectively utilizing text-to-image or image-to-image transformation. However, existing diffusion-based methods are either computationally expensive or compromise on performance. To address this issue, we introduce a novel training-free pipeline that integrates pretrained ControlNet and Vision-Language Models (VLMs) to generate synthetic images paired with pixel-level labels. This approach eliminates the need for manual annotations and significantly improves downstream tasks. To improve the fidelity and diversity, we add a Multi-way Prompt Generator, Mask Generator and High-quality Image Selection module. Our results on PASCAL-5i and COCO-20i present promising performance and outperform concurrent work for one-shot semantic segmentation.
title	Data Factory with Minimal Human Effort Using VLMs
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2510.05722

Similar Items