Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.05646 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866915987062784000 |
|---|---|
| author | Yang, Panqi Jing, Haodong Chao, Jiahao Xiang, Tingyan Lin, Li Hu, Yao Luo, Yang Ma, Yongqiang |
| author_facet | Yang, Panqi Jing, Haodong Chao, Jiahao Xiang, Tingyan Lin, Li Hu, Yao Luo, Yang Ma, Yongqiang |
| contents | Unified visual tokenization faces a fundamental trade-off between high-fidelity pixel reconstruction (spatial equivariance) and semantic abstraction (conceptual invariance). We attribute this conflict to Manifold Misalignment: naive joint optimization induces opposing gradients, creating a zero-sum game between reconstruction and perception. To address this, we propose MUSE, a framework based on Topological Orthogonality. By treating Structure as an orthogonal bridge, MUSE decouples optimization within Transformers: structural gradients refine attention topology, while semantic gradients update feature values. This turns destructive interference into Mutual Reinforcement. Experiments show that MUSE breaks the trade-off, achieving state-of-the-art generation quality (gFID 3.08) and surpassing its teacher InternViT-300M in linear probing (85.2\% vs. 82.5\%), demonstrating that structurally aligned reconstruction can enhance semantic perception. Code is available at https://github.com/PanqiYang1/MUSE. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2605_05646 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality Yang, Panqi Jing, Haodong Chao, Jiahao Xiang, Tingyan Lin, Li Hu, Yao Luo, Yang Ma, Yongqiang Computer Vision and Pattern Recognition Unified visual tokenization faces a fundamental trade-off between high-fidelity pixel reconstruction (spatial equivariance) and semantic abstraction (conceptual invariance). We attribute this conflict to Manifold Misalignment: naive joint optimization induces opposing gradients, creating a zero-sum game between reconstruction and perception. To address this, we propose MUSE, a framework based on Topological Orthogonality. By treating Structure as an orthogonal bridge, MUSE decouples optimization within Transformers: structural gradients refine attention topology, while semantic gradients update feature values. This turns destructive interference into Mutual Reinforcement. Experiments show that MUSE breaks the trade-off, achieving state-of-the-art generation quality (gFID 3.08) and surpassing its teacher InternViT-300M in linear probing (85.2\% vs. 82.5\%), demonstrating that structurally aligned reconstruction can enhance semantic perception. Code is available at https://github.com/PanqiYang1/MUSE. |
| title | MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2605.05646 |