Saved in:
| Main Authors: | Sakamoto, Keitaro, Sato, Issei |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.20829 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
End-to-End Training Induces Information Bottleneck through Layer-Role Differentiation: A Comparative Analysis with Layer-wise Training
by: Sakamoto, Keitaro, et al.
Published: (2024)
by: Sakamoto, Keitaro, et al.
Published: (2024)
Benign Overfitting in Token Selection of Attention Mechanism
by: Sakamoto, Keitaro, et al.
Published: (2024)
by: Sakamoto, Keitaro, et al.
Published: (2024)
Multiplicative Logit Adjustment Approximates Neural-Collapse-Aware Decision Boundary Adjustment
by: Hasegawa, Naoya, et al.
Published: (2024)
by: Hasegawa, Naoya, et al.
Published: (2024)
Can Test-time Computation Mitigate Reproduction Bias in Neural Symbolic Regression?
by: Sato, Shun, et al.
Published: (2025)
by: Sato, Shun, et al.
Published: (2025)
Explaining Grokking in Transformers through the Lens of Inductive Bias
by: Singh, Jaisidh, et al.
Published: (2026)
by: Singh, Jaisidh, et al.
Published: (2026)
Flatness is Necessary, Neural Collapse is Not: Rethinking Generalization via Grokking
by: Han, Ting, et al.
Published: (2025)
by: Han, Ting, et al.
Published: (2025)
Understanding Generalization in Physics Informed Models through Affine Variety Dimensions
by: Koshizuka, Takeshi, et al.
Published: (2025)
by: Koshizuka, Takeshi, et al.
Published: (2025)
Grokking Explained: A Statistical Phenomenon
by: Carvalho, Breno W., et al.
Published: (2025)
by: Carvalho, Breno W., et al.
Published: (2025)
Position: Solve Layerwise Linear Models First to Understand Neural Dynamical Phenomena (Neural Collapse, Emergence, Lazy/Rich Regime, and Grokking)
by: Nam, Yoonsoo, et al.
Published: (2025)
by: Nam, Yoonsoo, et al.
Published: (2025)
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation
by: Tomihari, Akiyoshi, et al.
Published: (2026)
by: Tomihari, Akiyoshi, et al.
Published: (2026)
Exploring Weight Balancing on Long-Tailed Recognition Problem
by: Hasegawa, Naoya, et al.
Published: (2023)
by: Hasegawa, Naoya, et al.
Published: (2023)
On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding
by: Xu, Kevin, et al.
Published: (2024)
by: Xu, Kevin, et al.
Published: (2024)
Understanding Linear Probing then Fine-tuning Language Models from NTK Perspective
by: Tomihari, Akiyoshi, et al.
Published: (2024)
by: Tomihari, Akiyoshi, et al.
Published: (2024)
Top-Down Bayesian Posterior Sampling for Sum-Product Networks
by: Yokoi, Soma, et al.
Published: (2024)
by: Yokoi, Soma, et al.
Published: (2024)
Late-Stage Generalization Collapse in Grokking: Detecting anti-grokking with Weightwatcher
by: Prakash, Hari K, et al.
Published: (2026)
by: Prakash, Hari K, et al.
Published: (2026)
Understanding the Expressivity and Trainability of Fourier Neural Operator: A Mean-Field Perspective
by: Koshizuka, Takeshi, et al.
Published: (2023)
by: Koshizuka, Takeshi, et al.
Published: (2023)
To CoT or To Loop? A Formal Comparison Between Chain-of-Thought and Looped Transformers
by: Xu, Kevin, et al.
Published: (2025)
by: Xu, Kevin, et al.
Published: (2025)
Max-pooling Network Revisited: Analyzing the Role of Semantic Probability in Multiple Instance Learning for Hallucination Detection
by: Fujikawa, Shota, et al.
Published: (2026)
by: Fujikawa, Shota, et al.
Published: (2026)
Rethinking Associative Memory Mechanism in Induction Head
by: Wang, Shuo, et al.
Published: (2024)
by: Wang, Shuo, et al.
Published: (2024)
On the Optimal Memorization Capacity of Transformers
by: Kajitsuka, Tokio, et al.
Published: (2024)
by: Kajitsuka, Tokio, et al.
Published: (2024)
Fix Initial Codes and Iteratively Refine Textual Directions Toward Safe Multi-Turn Code Correction
by: Tanaka, Yuto, et al.
Published: (2026)
by: Tanaka, Yuto, et al.
Published: (2026)
Grokking and Generalization Collapse: Insights from \texttt{HTSR} theory
by: Prakash, Hari K., et al.
Published: (2025)
by: Prakash, Hari K., et al.
Published: (2025)
A Formal Comparison Between Chain of Thought and Latent Thought
by: Xu, Kevin, et al.
Published: (2025)
by: Xu, Kevin, et al.
Published: (2025)
Understanding Transformer Optimization via Gradient Heterogeneity
by: Tomihari, Akiyoshi, et al.
Published: (2025)
by: Tomihari, Akiyoshi, et al.
Published: (2025)
Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?
by: Kajitsuka, Tokio, et al.
Published: (2023)
by: Kajitsuka, Tokio, et al.
Published: (2023)
Locking Pretrained Weights via Deep Low-Rank Residual Distillation
by: Sakamoto, Keitaro, et al.
Published: (2026)
by: Sakamoto, Keitaro, et al.
Published: (2026)
To Grok Grokking: Provable Grokking in Ridge Regression
by: Xu, Mingyue, et al.
Published: (2026)
by: Xu, Mingyue, et al.
Published: (2026)
Mutual Information Collapse Explains Disentanglement Failure in $β$-VAEs
by: Vu, Minh, et al.
Published: (2026)
by: Vu, Minh, et al.
Published: (2026)
Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking
by: Tian, Yuandong
Published: (2025)
by: Tian, Yuandong
Published: (2025)
NeuralGrok: Accelerate Grokking by Neural Gradient Transformation
by: Zhou, Xinyu, et al.
Published: (2025)
by: Zhou, Xinyu, et al.
Published: (2025)
Explaining and Preventing Alignment Collapse in Iterative RLHF
by: Gauthier, Etienne, et al.
Published: (2026)
by: Gauthier, Etienne, et al.
Published: (2026)
The Complexity Dynamics of Grokking
by: DeMoss, Branton, et al.
Published: (2024)
by: DeMoss, Branton, et al.
Published: (2024)
Measuring Sharpness in Grokking
by: Miller, Jack, et al.
Published: (2024)
by: Miller, Jack, et al.
Published: (2024)
Deep Grokking: Would Deep Neural Networks Generalize Better?
by: Fan, Simin, et al.
Published: (2024)
by: Fan, Simin, et al.
Published: (2024)
Grokking Beyond Neural Networks: An Empirical Exploration with Model Complexity
by: Miller, Jack, et al.
Published: (2023)
by: Miller, Jack, et al.
Published: (2023)
Bridging Lottery Ticket and Grokking: Understanding Grokking from Inner Structure of Networks
by: Minegishi, Gouki, et al.
Published: (2023)
by: Minegishi, Gouki, et al.
Published: (2023)
Directional Neural Collapse Explains Few-Shot Transfer in Self-Supervised Learning
by: Luthra, Achleshwar, et al.
Published: (2026)
by: Luthra, Achleshwar, et al.
Published: (2026)
Can Kernel Methods Explain How the Data Affects Neural Collapse?
by: Kothapalli, Vignesh, et al.
Published: (2024)
by: Kothapalli, Vignesh, et al.
Published: (2024)
Aligning Multimodal Representations through an Information Bottleneck
by: Almudévar, Antonio, et al.
Published: (2025)
by: Almudévar, Antonio, et al.
Published: (2025)
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds
by: Song, Yiding, et al.
Published: (2026)
by: Song, Yiding, et al.
Published: (2026)
Similar Items
-
End-to-End Training Induces Information Bottleneck through Layer-Role Differentiation: A Comparative Analysis with Layer-wise Training
by: Sakamoto, Keitaro, et al.
Published: (2024) -
Benign Overfitting in Token Selection of Attention Mechanism
by: Sakamoto, Keitaro, et al.
Published: (2024) -
Multiplicative Logit Adjustment Approximates Neural-Collapse-Aware Decision Boundary Adjustment
by: Hasegawa, Naoya, et al.
Published: (2024) -
Can Test-time Computation Mitigate Reproduction Bias in Neural Symbolic Regression?
by: Sato, Shun, et al.
Published: (2025) -
Explaining Grokking in Transformers through the Lens of Inductive Bias
by: Singh, Jaisidh, et al.
Published: (2026)