Saved in:
Bibliographic Details
Main Author: Ichikawa, Yuki
Format: Recurso digital
Language:
Published: Zenodo 2026
Online Access:https://doi.org/10.5281/zenodo.18667548
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • <p>We extend our previous finding of length-dependent processing mode transitions in Transformer <br>attention (Ichikawa, 2025) to three additional architectural families: Grouped Query Attention <br>(GQA), Mixture of Experts (MoE), and scaled dense models. Across eight models spanning four <br>architecture types, we demonstrate that input-length-dependent processing transitions are a <br>universal phenomenon, but their implementation varies by architecture. In standard Multi-Head <br>Attention (MHA) models under ~1B parameters, attention heads transition from independent to <br>cooperative processing at approximately 4 tokens. In GQA models, the transition depends on <br>the key-value sharing ratio: high sharing (8:1) preserves the transition, while lower sharing (4:1) <br>obscures it. In MoE models, attention heads show consistently distributed processing, but the <br>transition re-emerges at the Expert routing level — short inputs activate 2 of 64 Experts while <br>long inputs recruit 18. Garden-path experiments (N=105, 7 sentence types) reveal that MoE <br>models exhibit Expert “fixation”: garden-path sentences show significantly lower Expert Usage <br>Entropy than controls (Layer 0: p < 0.0001; Layer 15: p = 0.002) and significantly shorter Expert <br>persistence (Layer 1: p = 0.045; Layer 15: p = 0.010). This produces larger reanalysis costs <br>(+4.42 bits, p < 0.0001, d = 0.94) than observed in MHA models (+3.03 bits). These findings <br>establish the redundancy score as an architecture-diagnostic tool and reveal how different <br>Transformer designs implement the same computational challenge of input-length adaptation</p>