Guardado en:
Detalles Bibliográficos
Autores principales: Shao, Jie, Zhu, Ke, Fu, Minghao, Wang, Guo-hua, Wu, Jianxin
Formato: Preprint
Publicado: 2025
Materias:
Acceso en línea:https://arxiv.org/abs/2508.09598
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866908487739506688
author Shao, Jie
Zhu, Ke
Fu, Minghao
Wang, Guo-hua
Wu, Jianxin
author_facet Shao, Jie
Zhu, Ke
Fu, Minghao
Wang, Guo-hua
Wu, Jianxin
contents Diffusion models have achieved remarkable progress in class-to-image generation. However, we observe that despite impressive FID scores, state-of-the-art models often generate distorted or low-quality images, especially in certain classes. This gap arises because FID evaluates global distribution alignment, while ignoring the perceptual quality of individual samples. We further examine the role of CFG, a common technique used to enhance generation quality. While effective in improving metrics and suppressing outliers, CFG can introduce distribution shift and visual artifacts due to its misalignment with both training objectives and user expectations. In this work, we propose FaME, a training-free and inference-efficient method for improving perceptual quality. FaME uses an image quality assessment model to identify low-quality generations and stores their sampling trajectories. These failure modes are then used as negative guidance to steer future sampling away from poor-quality regions. Experiments on ImageNet demonstrate that FaME brings consistent improvements in visual quality without compromising FID. FaME also shows the potential to be extended to improve text-to-image generation.
format Preprint
id arxiv_https___arxiv_org_abs_2508_09598
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Images Speak Louder Than Scores: Failure Mode Escape for Enhancing Generative Quality
Shao, Jie
Zhu, Ke
Fu, Minghao
Wang, Guo-hua
Wu, Jianxin
Computer Vision and Pattern Recognition
Diffusion models have achieved remarkable progress in class-to-image generation. However, we observe that despite impressive FID scores, state-of-the-art models often generate distorted or low-quality images, especially in certain classes. This gap arises because FID evaluates global distribution alignment, while ignoring the perceptual quality of individual samples. We further examine the role of CFG, a common technique used to enhance generation quality. While effective in improving metrics and suppressing outliers, CFG can introduce distribution shift and visual artifacts due to its misalignment with both training objectives and user expectations. In this work, we propose FaME, a training-free and inference-efficient method for improving perceptual quality. FaME uses an image quality assessment model to identify low-quality generations and stores their sampling trajectories. These failure modes are then used as negative guidance to steer future sampling away from poor-quality regions. Experiments on ImageNet demonstrate that FaME brings consistent improvements in visual quality without compromising FID. FaME also shows the potential to be extended to improve text-to-image generation.
title Images Speak Louder Than Scores: Failure Mode Escape for Enhancing Generative Quality
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2508.09598