Saved in:
| Main Authors: | Zhao, Yize, Thrampoulidis, Christos |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.12011 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Supervised Contrastive Representation Learning: Landscape Analysis with Unconstrained Features
by: Behnia, Tina, et al.
Published: (2024)
by: Behnia, Tina, et al.
Published: (2024)
Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations
by: Zhao, Yize, et al.
Published: (2024)
by: Zhao, Yize, et al.
Published: (2024)
How Muon's Spectral Design Benefits Generalization: A Study on Imbalanced Data
by: Vasudeva, Bhavya, et al.
Published: (2025)
by: Vasudeva, Bhavya, et al.
Published: (2025)
Thumb on the Scale: Optimal Loss Weighting in Last Layer Retraining
by: Stromberg, Nathan, et al.
Published: (2025)
by: Stromberg, Nathan, et al.
Published: (2025)
Implicit Optimization Bias of Next-Token Prediction in Linear Models
by: Thrampoulidis, Christos
Published: (2024)
by: Thrampoulidis, Christos
Published: (2024)
DARE the Extreme: Revisiting Delta-Parameter Pruning For Fine-Tuned Models
by: Deng, Wenlong, et al.
Published: (2024)
by: Deng, Wenlong, et al.
Published: (2024)
Geometry of Semantics in Next-Token Prediction: How Optimization Implicitly Organizes Linguistic Representations
by: Zhao, Yize, et al.
Published: (2025)
by: Zhao, Yize, et al.
Published: (2025)
Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics
by: Garrod, Connall, et al.
Published: (2025)
by: Garrod, Connall, et al.
Published: (2025)
Memorization Capacity of Multi-Head Attention in Transformers
by: Mahdavi, Sadegh, et al.
Published: (2023)
by: Mahdavi, Sadegh, et al.
Published: (2023)
Memory capacity of two layer neural networks with smooth activations
by: Madden, Liam, et al.
Published: (2023)
by: Madden, Liam, et al.
Published: (2023)
Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients
by: Thrampoulidis, Christos, et al.
Published: (2025)
by: Thrampoulidis, Christos, et al.
Published: (2025)
Facts in Stats: Impacts of Pretraining Diversity on Language Model Generalization
by: Behnia, Tina, et al.
Published: (2025)
by: Behnia, Tina, et al.
Published: (2025)
Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods
by: Taheri, Hossein, et al.
Published: (2024)
by: Taheri, Hossein, et al.
Published: (2024)
Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data
by: Fan, Chen, et al.
Published: (2025)
by: Fan, Chen, et al.
Published: (2025)
Implicit Bias and Fast Convergence Rates for Self-attention
by: Vasudeva, Bhavya, et al.
Published: (2024)
by: Vasudeva, Bhavya, et al.
Published: (2024)
Unlocking the Potential of Prompt-Tuning in Bridging Generalized and Personalized Federated Learning
by: Deng, Wenlong, et al.
Published: (2023)
by: Deng, Wenlong, et al.
Published: (2023)
You Only Train Once
by: Sakaridis, Christos
Published: (2025)
by: Sakaridis, Christos
Published: (2025)
Next-token prediction capacity: general upper bounds and a lower bound for transformers
by: Madden, Liam, et al.
Published: (2024)
by: Madden, Liam, et al.
Published: (2024)
On the Optimization and Generalization of Multi-head Attention
by: Deora, Puneesh, et al.
Published: (2023)
by: Deora, Puneesh, et al.
Published: (2023)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation
by: Mahdavi, Sadegh, et al.
Published: (2025)
by: Mahdavi, Sadegh, et al.
Published: (2025)
In-Context Occam's Razor: How Transformers Prefer Simpler Hypotheses on the Fly
by: Deora, Puneesh, et al.
Published: (2025)
by: Deora, Puneesh, et al.
Published: (2025)
Neural Collapse for Cross-entropy Class-Imbalanced Learning with Unconstrained ReLU Feature Model
by: Dang, Hien, et al.
Published: (2024)
by: Dang, Hien, et al.
Published: (2024)
Infinite Width Models That Work: Why Feature Learning Doesn't Matter as Much as You Think
by: Sernau, Luke
Published: (2024)
by: Sernau, Luke
Published: (2024)
Understanding Contextual Recall in Transformers: How Finetuning Enables In-Context Reasoning over Pretraining Knowledge
by: Vasudeva, Bhavya, et al.
Published: (2026)
by: Vasudeva, Bhavya, et al.
Published: (2026)
Class-attribute Priors: Adapting Optimization to Heterogeneity and Fairness Objective
by: Zhang, Xuechen, et al.
Published: (2024)
by: Zhang, Xuechen, et al.
Published: (2024)
Transformers as Support Vector Machines
by: Tarzanagh, Davoud Ataee, et al.
Published: (2023)
by: Tarzanagh, Davoud Ataee, et al.
Published: (2023)
Geometric Analysis of Unconstrained Feature Models with $d=K$
by: Shen, Yi, et al.
Published: (2024)
by: Shen, Yi, et al.
Published: (2024)
Instance-dependent Early Stopping
by: Yuan, Suqin, et al.
Published: (2025)
by: Yuan, Suqin, et al.
Published: (2025)
LLM-Assisted Content Conditional Debiasing for Fair Text Embedding
by: Deng, Wenlong, et al.
Published: (2024)
by: Deng, Wenlong, et al.
Published: (2024)
Neural Collapse Beyond the Unconstrained Features Model: Landscape, Dynamics, and Generalization in the Mean-Field Regime
by: Wu, Diyuan, et al.
Published: (2025)
by: Wu, Diyuan, et al.
Published: (2025)
ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation
by: Gandhi, Swapnil, et al.
Published: (2024)
by: Gandhi, Swapnil, et al.
Published: (2024)
Early Stopping Tabular In-Context Learning
by: Küken, Jaris, et al.
Published: (2025)
by: Küken, Jaris, et al.
Published: (2025)
Noisy Early Stopping for Noisy Labels
by: Toner, William, et al.
Published: (2024)
by: Toner, William, et al.
Published: (2024)
FLOP-Efficient Training: Early Stopping Based on Test-Time Compute Awareness
by: Amer, Hossam, et al.
Published: (2026)
by: Amer, Hossam, et al.
Published: (2026)
On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization
by: Deng, Wenlong, et al.
Published: (2025)
by: Deng, Wenlong, et al.
Published: (2025)
Neural Multivariate Regression: Qualitative Insights from the Unconstrained Feature Model
by: Andriopoulos, George, et al.
Published: (2025)
by: Andriopoulos, George, et al.
Published: (2025)
Parameter-Free Dynamic Regret for Unconstrained Linear Bandits
by: Rumi, Alberto, et al.
Published: (2026)
by: Rumi, Alberto, et al.
Published: (2026)
Early Stopping Based on Repeated Significance
by: Bax, Eric, et al.
Published: (2024)
by: Bax, Eric, et al.
Published: (2024)
Early Stopping for Large Reasoning Models via Confidence Dynamics
by: Hosseini, Parsa, et al.
Published: (2026)
by: Hosseini, Parsa, et al.
Published: (2026)
Gradient-Variation Regret Bounds for Unconstrained Online Learning
by: Zhao, Yuheng, et al.
Published: (2026)
by: Zhao, Yuheng, et al.
Published: (2026)
Similar Items
-
Supervised Contrastive Representation Learning: Landscape Analysis with Unconstrained Features
by: Behnia, Tina, et al.
Published: (2024) -
Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations
by: Zhao, Yize, et al.
Published: (2024) -
How Muon's Spectral Design Benefits Generalization: A Study on Imbalanced Data
by: Vasudeva, Bhavya, et al.
Published: (2025) -
Thumb on the Scale: Optimal Loss Weighting in Last Layer Retraining
by: Stromberg, Nathan, et al.
Published: (2025) -
Implicit Optimization Bias of Next-Token Prediction in Linear Models
by: Thrampoulidis, Christos
Published: (2024)