Saved in:
Bibliographic Details
Main Authors: Yang, Panqi, Jing, Haodong, Chao, Jiahao, Xiang, Tingyan, Lin, Li, Hu, Yao, Luo, Yang, Ma, Yongqiang
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.05646
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915987062784000
author Yang, Panqi
Jing, Haodong
Chao, Jiahao
Xiang, Tingyan
Lin, Li
Hu, Yao
Luo, Yang
Ma, Yongqiang
author_facet Yang, Panqi
Jing, Haodong
Chao, Jiahao
Xiang, Tingyan
Lin, Li
Hu, Yao
Luo, Yang
Ma, Yongqiang
contents Unified visual tokenization faces a fundamental trade-off between high-fidelity pixel reconstruction (spatial equivariance) and semantic abstraction (conceptual invariance). We attribute this conflict to Manifold Misalignment: naive joint optimization induces opposing gradients, creating a zero-sum game between reconstruction and perception. To address this, we propose MUSE, a framework based on Topological Orthogonality. By treating Structure as an orthogonal bridge, MUSE decouples optimization within Transformers: structural gradients refine attention topology, while semantic gradients update feature values. This turns destructive interference into Mutual Reinforcement. Experiments show that MUSE breaks the trade-off, achieving state-of-the-art generation quality (gFID 3.08) and surpassing its teacher InternViT-300M in linear probing (85.2\% vs. 82.5\%), demonstrating that structurally aligned reconstruction can enhance semantic perception. Code is available at https://github.com/PanqiYang1/MUSE.
format Preprint
id arxiv_https___arxiv_org_abs_2605_05646
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
Yang, Panqi
Jing, Haodong
Chao, Jiahao
Xiang, Tingyan
Lin, Li
Hu, Yao
Luo, Yang
Ma, Yongqiang
Computer Vision and Pattern Recognition
Unified visual tokenization faces a fundamental trade-off between high-fidelity pixel reconstruction (spatial equivariance) and semantic abstraction (conceptual invariance). We attribute this conflict to Manifold Misalignment: naive joint optimization induces opposing gradients, creating a zero-sum game between reconstruction and perception. To address this, we propose MUSE, a framework based on Topological Orthogonality. By treating Structure as an orthogonal bridge, MUSE decouples optimization within Transformers: structural gradients refine attention topology, while semantic gradients update feature values. This turns destructive interference into Mutual Reinforcement. Experiments show that MUSE breaks the trade-off, achieving state-of-the-art generation quality (gFID 3.08) and surpassing its teacher InternViT-300M in linear probing (85.2\% vs. 82.5\%), demonstrating that structurally aligned reconstruction can enhance semantic perception. Code is available at https://github.com/PanqiYang1/MUSE.
title MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2605.05646