Saved in:
Bibliographic Details
Main Authors: Gong, Yue, Li, Hongyu, Liu, Shanyuan, Cheng, Bo, Ma, Yuhang, Wu, Liebucha, Wu, Xiaoyu, Zhang, Manyuan, Leng, Dawei, Yin, Yuhui, Zhang, Lijun
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.19206
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917353679224832
author Gong, Yue
Li, Hongyu
Liu, Shanyuan
Cheng, Bo
Ma, Yuhang
Wu, Liebucha
Wu, Xiaoyu
Zhang, Manyuan
Leng, Dawei
Yin, Yuhui
Zhang, Lijun
author_facet Gong, Yue
Li, Hongyu
Liu, Shanyuan
Cheng, Bo
Ma, Yuhang
Wu, Liebucha
Wu, Xiaoyu
Zhang, Manyuan
Leng, Dawei
Yin, Yuhui
Zhang, Lijun
contents Diffusion models have become the dominant paradigm for image generation and editing, with latent diffusion models shifting denoising to a compact latent space for efficiency and scalability. Recent attempts to leverage pretrained visual representation models as tokenizer priors either align diffusion features to representation features or directly reuse representation encoders as frozen tokenizers. Although such approaches can improve generation metrics, they often suffer from limited reconstruction fidelity due to frozen encoders, which in turn degrades editing quality, as well as overly high-dimensional latents that make diffusion modeling difficult. To address these limitations, We propose Representation-Pivoted AutoEncoder, a representation-based tokenizer that improves both generation and editing. We introduce Representation-Pivot Regularization, a training strategy that enables a representation-initialized encoder to be fine-tuned for reconstruction while preserving the semantic structure of the pretrained representation space, followed by a variational bridge which compress latent space into a compact one for better diffusion modeling. We adopt an objective-decoupled stage-wise training strategy that sequentially optimizes generative tractability and reconstruction-fidelity objectives. Together, these components yield a tokenizer that preserves strong semantics, reconstructs faithfully, and produces latents with reduced diffusion modeling complexity. Experiments demonstrate that RPiAE outperforms other visual tokenizers on text-to-image generation and image editing, while delivering the best reconstruction fidelity among representation-based tokenizers.
format Preprint
id arxiv_https___arxiv_org_abs_2603_19206
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing
Gong, Yue
Li, Hongyu
Liu, Shanyuan
Cheng, Bo
Ma, Yuhang
Wu, Liebucha
Wu, Xiaoyu
Zhang, Manyuan
Leng, Dawei
Yin, Yuhui
Zhang, Lijun
Computer Vision and Pattern Recognition
Diffusion models have become the dominant paradigm for image generation and editing, with latent diffusion models shifting denoising to a compact latent space for efficiency and scalability. Recent attempts to leverage pretrained visual representation models as tokenizer priors either align diffusion features to representation features or directly reuse representation encoders as frozen tokenizers. Although such approaches can improve generation metrics, they often suffer from limited reconstruction fidelity due to frozen encoders, which in turn degrades editing quality, as well as overly high-dimensional latents that make diffusion modeling difficult. To address these limitations, We propose Representation-Pivoted AutoEncoder, a representation-based tokenizer that improves both generation and editing. We introduce Representation-Pivot Regularization, a training strategy that enables a representation-initialized encoder to be fine-tuned for reconstruction while preserving the semantic structure of the pretrained representation space, followed by a variational bridge which compress latent space into a compact one for better diffusion modeling. We adopt an objective-decoupled stage-wise training strategy that sequentially optimizes generative tractability and reconstruction-fidelity objectives. Together, these components yield a tokenizer that preserves strong semantics, reconstructs faithfully, and produces latents with reduced diffusion modeling complexity. Experiments demonstrate that RPiAE outperforms other visual tokenizers on text-to-image generation and image editing, while delivering the best reconstruction fidelity among representation-based tokenizers.
title RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2603.19206