Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xu, Wanghan, Yue, Xiaoyu, Wang, Zidong, Teng, Yao, Zhang, Wenlong, Liu, Xihui, Zhou, Luping, Ouyang, Wanli, Bai, Lei
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2502.00359
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910808325226496
author	Xu, Wanghan Yue, Xiaoyu Wang, Zidong Teng, Yao Zhang, Wenlong Liu, Xihui Zhou, Luping Ouyang, Wanli Bai, Lei
author_facet	Xu, Wanghan Yue, Xiaoyu Wang, Zidong Teng, Yao Zhang, Wenlong Liu, Xihui Zhou, Luping Ouyang, Wanli Bai, Lei
contents	Generative models serve as powerful tools for modeling the real world, with mainstream diffusion models, particularly those based on the latent diffusion model paradigm, achieving remarkable progress across various tasks, such as image and video synthesis. Latent diffusion models are typically trained using Variational Autoencoders (VAEs), interacting with VAE latents rather than the real samples. While this generative paradigm speeds up training and inference, the quality of the generated outputs is limited by the latents' quality. Traditional VAE latents are often seen as spatial compression in pixel space and lack explicit semantic representations, which are essential for modeling the real world. In this paper, we introduce ReaLS (Representation-Aligned Latent Space), which integrates semantic priors to improve generation performance. Extensive experiments show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric. Furthermore, the enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_00359
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Exploring Representation-Aligned Latent Space for Better Generation Xu, Wanghan Yue, Xiaoyu Wang, Zidong Teng, Yao Zhang, Wenlong Liu, Xihui Zhou, Luping Ouyang, Wanli Bai, Lei Machine Learning Generative models serve as powerful tools for modeling the real world, with mainstream diffusion models, particularly those based on the latent diffusion model paradigm, achieving remarkable progress across various tasks, such as image and video synthesis. Latent diffusion models are typically trained using Variational Autoencoders (VAEs), interacting with VAE latents rather than the real samples. While this generative paradigm speeds up training and inference, the quality of the generated outputs is limited by the latents' quality. Traditional VAE latents are often seen as spatial compression in pixel space and lack explicit semantic representations, which are essential for modeling the real world. In this paper, we introduce ReaLS (Representation-Aligned Latent Space), which integrates semantic priors to improve generation performance. Extensive experiments show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric. Furthermore, the enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.
title	Exploring Representation-Aligned Latent Space for Better Generation
topic	Machine Learning
url	https://arxiv.org/abs/2502.00359

Similar Items