Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Shao, Jie, Zhu, Ke, Fu, Minghao, Wang, Guo-hua, Wu, Jianxin
Formato:	Preprint
Publicado:	2025
Materias:	Computer Vision and Pattern Recognition
Acceso en línea:	https://arxiv.org/abs/2508.09598
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866908487739506688
author	Shao, Jie Zhu, Ke Fu, Minghao Wang, Guo-hua Wu, Jianxin
author_facet	Shao, Jie Zhu, Ke Fu, Minghao Wang, Guo-hua Wu, Jianxin
contents	Diffusion models have achieved remarkable progress in class-to-image generation. However, we observe that despite impressive FID scores, state-of-the-art models often generate distorted or low-quality images, especially in certain classes. This gap arises because FID evaluates global distribution alignment, while ignoring the perceptual quality of individual samples. We further examine the role of CFG, a common technique used to enhance generation quality. While effective in improving metrics and suppressing outliers, CFG can introduce distribution shift and visual artifacts due to its misalignment with both training objectives and user expectations. In this work, we propose FaME, a training-free and inference-efficient method for improving perceptual quality. FaME uses an image quality assessment model to identify low-quality generations and stores their sampling trajectories. These failure modes are then used as negative guidance to steer future sampling away from poor-quality regions. Experiments on ImageNet demonstrate that FaME brings consistent improvements in visual quality without compromising FID. FaME also shows the potential to be extended to improve text-to-image generation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2508_09598
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Images Speak Louder Than Scores: Failure Mode Escape for Enhancing Generative Quality Shao, Jie Zhu, Ke Fu, Minghao Wang, Guo-hua Wu, Jianxin Computer Vision and Pattern Recognition Diffusion models have achieved remarkable progress in class-to-image generation. However, we observe that despite impressive FID scores, state-of-the-art models often generate distorted or low-quality images, especially in certain classes. This gap arises because FID evaluates global distribution alignment, while ignoring the perceptual quality of individual samples. We further examine the role of CFG, a common technique used to enhance generation quality. While effective in improving metrics and suppressing outliers, CFG can introduce distribution shift and visual artifacts due to its misalignment with both training objectives and user expectations. In this work, we propose FaME, a training-free and inference-efficient method for improving perceptual quality. FaME uses an image quality assessment model to identify low-quality generations and stores their sampling trajectories. These failure modes are then used as negative guidance to steer future sampling away from poor-quality regions. Experiments on ImageNet demonstrate that FaME brings consistent improvements in visual quality without compromising FID. FaME also shows the potential to be extended to improve text-to-image generation.
title	Images Speak Louder Than Scores: Failure Mode Escape for Enhancing Generative Quality
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2508.09598

Ejemplares similares