Saved in:
Bibliographic Details
Main Authors: Xu, Wanghan, Yue, Xiaoyu, Wang, Zidong, Teng, Yao, Zhang, Wenlong, Liu, Xihui, Zhou, Luping, Ouyang, Wanli, Bai, Lei
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.00359
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910808325226496
author Xu, Wanghan
Yue, Xiaoyu
Wang, Zidong
Teng, Yao
Zhang, Wenlong
Liu, Xihui
Zhou, Luping
Ouyang, Wanli
Bai, Lei
author_facet Xu, Wanghan
Yue, Xiaoyu
Wang, Zidong
Teng, Yao
Zhang, Wenlong
Liu, Xihui
Zhou, Luping
Ouyang, Wanli
Bai, Lei
contents Generative models serve as powerful tools for modeling the real world, with mainstream diffusion models, particularly those based on the latent diffusion model paradigm, achieving remarkable progress across various tasks, such as image and video synthesis. Latent diffusion models are typically trained using Variational Autoencoders (VAEs), interacting with VAE latents rather than the real samples. While this generative paradigm speeds up training and inference, the quality of the generated outputs is limited by the latents' quality. Traditional VAE latents are often seen as spatial compression in pixel space and lack explicit semantic representations, which are essential for modeling the real world. In this paper, we introduce ReaLS (Representation-Aligned Latent Space), which integrates semantic priors to improve generation performance. Extensive experiments show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric. Furthermore, the enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.
format Preprint
id arxiv_https___arxiv_org_abs_2502_00359
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Exploring Representation-Aligned Latent Space for Better Generation
Xu, Wanghan
Yue, Xiaoyu
Wang, Zidong
Teng, Yao
Zhang, Wenlong
Liu, Xihui
Zhou, Luping
Ouyang, Wanli
Bai, Lei
Machine Learning
Generative models serve as powerful tools for modeling the real world, with mainstream diffusion models, particularly those based on the latent diffusion model paradigm, achieving remarkable progress across various tasks, such as image and video synthesis. Latent diffusion models are typically trained using Variational Autoencoders (VAEs), interacting with VAE latents rather than the real samples. While this generative paradigm speeds up training and inference, the quality of the generated outputs is limited by the latents' quality. Traditional VAE latents are often seen as spatial compression in pixel space and lack explicit semantic representations, which are essential for modeling the real world. In this paper, we introduce ReaLS (Representation-Aligned Latent Space), which integrates semantic priors to improve generation performance. Extensive experiments show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric. Furthermore, the enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.
title Exploring Representation-Aligned Latent Space for Better Generation
topic Machine Learning
url https://arxiv.org/abs/2502.00359