Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Ichikawa, Yuki
Format:	Recurso digital
Language:
Published:	Zenodo 2026
Online Access:	https://doi.org/10.5281/zenodo.18667548
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

We extend our previous finding of length-dependent processing mode transitions in Transformer  attention (Ichikawa, 2025) to three additional architectural families: Grouped Query Attention  (GQA), Mixture of Experts (MoE), and scaled dense models. Across eight models spanning four  architecture types, we demonstrate that input-length-dependent processing transitions are a  universal phenomenon, but their implementation varies by architecture. In standard Multi-Head  Attention (MHA) models under ~1B parameters, attention heads transition from independent to  cooperative processing at approximately 4 tokens. In GQA models, the transition depends on  the key-value sharing ratio: high sharing (8:1) preserves the transition, while lower sharing (4:1)  obscures it. In MoE models, attention heads show consistently distributed processing, but the  transition re-emerges at the Expert routing level — short inputs activate 2 of 64 Experts while  long inputs recruit 18. Garden-path experiments (N=105, 7 sentence types) reveal that MoE  models exhibit Expert “fixation”: garden-path sentences show significantly lower Expert Usage  Entropy than controls (Layer 0: p < 0.0001; Layer 15: p = 0.002) and significantly shorter Expert  persistence (Layer 1: p = 0.045; Layer 15: p = 0.010). This produces larger reanalysis costs  (+4.42 bits, p < 0.0001, d = 0.94) than observed in MHA models (+3.03 bits). These findings  establish the redundancy score as an architecture-diagnostic tool and reveal how different  Transformer designs implement the same computational challenge of input-length adaptation

Similar Items