Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.05722 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866915535865774080 |
|---|---|
| author | Ye, Jiaojiao Zhong, Jiaxing Xie, Qian Zhou, Yuzhou Trigoni, Niki Markham, Andrew |
| author_facet | Ye, Jiaojiao Zhong, Jiaxing Xie, Qian Zhou, Yuzhou Trigoni, Niki Markham, Andrew |
| contents | Generating enough and diverse data through augmentation offers an efficient solution to the time-consuming and labour-intensive process of collecting and annotating pixel-wise images. Traditional data augmentation techniques often face challenges in manipulating high-level semantic attributes, such as materials and textures. In contrast, diffusion models offer a robust alternative, by effectively utilizing text-to-image or image-to-image transformation. However, existing diffusion-based methods are either computationally expensive or compromise on performance. To address this issue, we introduce a novel training-free pipeline that integrates pretrained ControlNet and Vision-Language Models (VLMs) to generate synthetic images paired with pixel-level labels. This approach eliminates the need for manual annotations and significantly improves downstream tasks. To improve the fidelity and diversity, we add a Multi-way Prompt Generator, Mask Generator and High-quality Image Selection module. Our results on PASCAL-5i and COCO-20i present promising performance and outperform concurrent work for one-shot semantic segmentation. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2510_05722 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Data Factory with Minimal Human Effort Using VLMs Ye, Jiaojiao Zhong, Jiaxing Xie, Qian Zhou, Yuzhou Trigoni, Niki Markham, Andrew Computer Vision and Pattern Recognition Generating enough and diverse data through augmentation offers an efficient solution to the time-consuming and labour-intensive process of collecting and annotating pixel-wise images. Traditional data augmentation techniques often face challenges in manipulating high-level semantic attributes, such as materials and textures. In contrast, diffusion models offer a robust alternative, by effectively utilizing text-to-image or image-to-image transformation. However, existing diffusion-based methods are either computationally expensive or compromise on performance. To address this issue, we introduce a novel training-free pipeline that integrates pretrained ControlNet and Vision-Language Models (VLMs) to generate synthetic images paired with pixel-level labels. This approach eliminates the need for manual annotations and significantly improves downstream tasks. To improve the fidelity and diversity, we add a Multi-way Prompt Generator, Mask Generator and High-quality Image Selection module. Our results on PASCAL-5i and COCO-20i present promising performance and outperform concurrent work for one-shot semantic segmentation. |
| title | Data Factory with Minimal Human Effort Using VLMs |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2510.05722 |