Saved in:
| Main Author: | |
|---|---|
| Format: | Recurso digital |
| Language: | |
| Published: |
Zenodo
2026
|
| Online Access: | https://doi.org/10.5281/zenodo.18667548 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Table of Contents:
- <p>We extend our previous finding of length-dependent processing mode transitions in Transformer <br>attention (Ichikawa, 2025) to three additional architectural families: Grouped Query Attention <br>(GQA), Mixture of Experts (MoE), and scaled dense models. Across eight models spanning four <br>architecture types, we demonstrate that input-length-dependent processing transitions are a <br>universal phenomenon, but their implementation varies by architecture. In standard Multi-Head <br>Attention (MHA) models under ~1B parameters, attention heads transition from independent to <br>cooperative processing at approximately 4 tokens. In GQA models, the transition depends on <br>the key-value sharing ratio: high sharing (8:1) preserves the transition, while lower sharing (4:1) <br>obscures it. In MoE models, attention heads show consistently distributed processing, but the <br>transition re-emerges at the Expert routing level — short inputs activate 2 of 64 Experts while <br>long inputs recruit 18. Garden-path experiments (N=105, 7 sentence types) reveal that MoE <br>models exhibit Expert “fixation”: garden-path sentences show significantly lower Expert Usage <br>Entropy than controls (Layer 0: p < 0.0001; Layer 15: p = 0.002) and significantly shorter Expert <br>persistence (Layer 1: p = 0.045; Layer 15: p = 0.010). This produces larger reanalysis costs <br>(+4.42 bits, p < 0.0001, d = 0.94) than observed in MHA models (+3.03 bits). These findings <br>establish the redundancy score as an architecture-diagnostic tool and reveal how different <br>Transformer designs implement the same computational challenge of input-length adaptation</p>