Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lin, Han, Pan, Xichen, Wang, Zun, Zhang, Yue, Wang, Chu, Cho, Jaemin, Bansal, Mohit
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2603.16792
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915870075256832
author	Lin, Han Pan, Xichen Wang, Zun Zhang, Yue Wang, Chu Cho, Jaemin Bansal, Mohit
author_facet	Lin, Han Pan, Xichen Wang, Zun Zhang, Yue Wang, Chu Cho, Jaemin Bansal, Mohit
contents	Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make visual co-denoising effective. Our study reveals four key ingredients for effective visual co-denoising. First, preserving feature-specific computation while enabling flexible cross-stream interaction motivates a fully dual-stream architecture. Second, effective classifier-free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual-drifting hybrid loss. Fourth, stable co-denoising further requires proper cross-stream calibration, which we realize through RMS-based feature rescaling. Together, these findings yield a simple recipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V-Co outperforms the underlying pixel-space diffusion baseline and strong prior pixel-diffusion methods while using fewer training epochs, offering practical guidance for future representation-aligned generative models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_16792
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising Lin, Han Pan, Xichen Wang, Zun Zhang, Yue Wang, Chu Cho, Jaemin Bansal, Mohit Computer Vision and Pattern Recognition Artificial Intelligence Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make visual co-denoising effective. Our study reveals four key ingredients for effective visual co-denoising. First, preserving feature-specific computation while enabling flexible cross-stream interaction motivates a fully dual-stream architecture. Second, effective classifier-free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual-drifting hybrid loss. Fourth, stable co-denoising further requires proper cross-stream calibration, which we realize through RMS-based feature rescaling. Together, these findings yield a simple recipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V-Co outperforms the underlying pixel-space diffusion baseline and strong prior pixel-diffusion methods while using fewer training epochs, offering practical guidance for future representation-aligned generative models.
title	V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2603.16792

Similar Items