Saved in:
Bibliographic Details
Main Authors: Xu, Chao, Li, Maohua, Li, Qirui, Xu, Yixuan, Zhou, Yanke, Li, Yunhe, Shen, Cuifeng, Tang, Hanlin, Liu, Kan, Lan, Tao, Qu, Lin, Zhang, Shao-Qun
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.20708
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913147542044672
author Xu, Chao
Li, Maohua
Li, Qirui
Xu, Yixuan
Zhou, Yanke
Li, Yunhe
Shen, Cuifeng
Tang, Hanlin
Liu, Kan
Lan, Tao
Qu, Lin
Zhang, Shao-Qun
author_facet Xu, Chao
Li, Maohua
Li, Qirui
Xu, Yixuan
Zhou, Yanke
Li, Yunhe
Shen, Cuifeng
Tang, Hanlin
Liu, Kan
Lan, Tao
Qu, Lin
Zhang, Shao-Qun
contents Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and non-incremental} aggregation over the history of sublayer outputs. Moreover, the proposed \textsc{DAR} is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet $256\times256$, \textsc{DAR} improves SiT-XL/2 by $2.11$ FID ($7.56$ vs.\ $9.67$) and matches the baseline's converged quality with $8.75\times$ fewer training iterations. Stacked on top of REPA, it yields a $2\times$ training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textsc{DAR} can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.
format Preprint
id arxiv_https___arxiv_org_abs_2605_20708
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Rethinking Cross-Layer Information Routing in Diffusion Transformers
Xu, Chao
Li, Maohua
Li, Qirui
Xu, Yixuan
Zhou, Yanke
Li, Yunhe
Shen, Cuifeng
Tang, Hanlin
Liu, Kan
Lan, Tao
Qu, Lin
Zhang, Shao-Qun
Computer Vision and Pattern Recognition
Artificial Intelligence
Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and non-incremental} aggregation over the history of sublayer outputs. Moreover, the proposed \textsc{DAR} is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet $256\times256$, \textsc{DAR} improves SiT-XL/2 by $2.11$ FID ($7.56$ vs.\ $9.67$) and matches the baseline's converged quality with $8.75\times$ fewer training iterations. Stacked on top of REPA, it yields a $2\times$ training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textsc{DAR} can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.
title Rethinking Cross-Layer Information Routing in Diffusion Transformers
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2605.20708