Inhaltsangabe: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
1. Verfasser:	Wasserman, Adam Zachary
Format:	Recurso digital
Sprache:	Englisch
Veröffentlicht:	Zenodo 2026
Schlagworte:	scaling laws morphology cross-linguistic emergence training dynamics
Online-Zugang:	https://doi.org/10.5281/zenodo.19423151
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Inhaltsangabe:

The scaling hypothesis holds that large language model performance improves predictably with increased compute, data, and parameters, following power-law relationships assumed to be universal [Kaplan et al., 2020, Hoffmann et al., 2022]. We test this assumption via a pre-registered controlled ablation (Pre-registration: OSF 10.17605/OSF.IO/SJ48B; Project: OSF 10.17605/OSF.IO/2PG8S), training identical 125M-parameter transformers on matched English and French corpora from C4, holding all hyperparameters constant. Confirming our pre-registered prediction, we observe dramatically divergent learning trajectories: French achieves grammatical competence (100% on agreement probes) at 197M tokens and maintains it through experiment completion at 181K steps (&sim;3B tokens), while English remains at chance level (40%) throughout, a >15x difference in emergence threshold. Perplexity trajectories show French approaching near-final values (PPL&sim;27) while English remains elevated (PPL&sim;1340), a 50x ratio at matched training steps. Cross-study comparison with Pythia 125M [Biderman et al., 2023], which required&sim;300B tokens to reach comparable perplexity, serves two functions: it validates that our English model performs as expected (consistent with established scaling behavior), and it suggests French may be 50–100x more training-efficient than English. These results support our hypothesis that morphologically rich languages provide redundant grammatical signals that accelerate structural learning. Critically, we show that perplexity and grammatical accuracy are orthogonal dimensions governed by different determinants: distributional coherence and morphological explicitness, respectively. This explains why English models can improve perplexity indefinitely while never acquiring grammar—standard evaluation metrics miss structural learning deficits entirely. The scaling hypothesis is language-contingent, not universal. Note: Pre-registered 350M experiments are complete but inconclusive due to batch size constraints. French 350M reached only 70% accuracy after 819M tokens (4×the tokens French 125M needed to emerge), suggesting scale may be counterproductive for morphologically rich languages. English 350M remained at 40% accuracy. We are re-running 350M experiments to 3.3B tokens to match the 125M token budget and will publish updated results. Training logs: https://github.com/ adamzwasserman/fractal-language

Ähnliche Einträge