Saved in:
| Main Authors: | Nishikawa, Naoki, Higuchi, Rei, Suzuki, Taiji |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.03340 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Direct Density Ratio Optimization: A Statistically Consistent Approach to Aligning Large Language Models
by: Higuchi, Rei, et al.
Published: (2025)
by: Higuchi, Rei, et al.
Published: (2025)
State Space Models are Provably Comparable to Transformers in Dynamic Token Selection
by: Nishikawa, Naoki, et al.
Published: (2024)
by: Nishikawa, Naoki, et al.
Published: (2024)
Inference-Aware Meta-Alignment of LLMs via Non-Linear GRPO
by: Takakura, Shokichi, et al.
Published: (2026)
by: Takakura, Shokichi, et al.
Published: (2026)
Transformers Learn Nonlinear Features In Context: Nonconvex Mean-field Dynamics on the Attention Landscape
by: Kim, Juno, et al.
Published: (2024)
by: Kim, Juno, et al.
Published: (2024)
How Neural Reward Models Learn Features for Policy Optimization: A Single-Index Analysis
by: Higuchi, Rei, et al.
Published: (2026)
by: Higuchi, Rei, et al.
Published: (2026)
When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars
by: Higuchi, Rei, et al.
Published: (2025)
by: Higuchi, Rei, et al.
Published: (2025)
Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression
by: Zuo, Yifei, et al.
Published: (2025)
by: Zuo, Yifei, et al.
Published: (2025)
Why Softmax Attention Outperforms Linear Attention
by: Deng, Yichuan, et al.
Published: (2023)
by: Deng, Yichuan, et al.
Published: (2023)
A Relative-Budget Theory for Reinforcement Learning with Verifiable Rewards in Large Language Model Reasoning
by: Wachi, Akifumi, et al.
Published: (2026)
by: Wachi, Akifumi, et al.
Published: (2026)
Mixture of Experts Provably Detect and Learn the Latent Cluster Structure in Gradient-Based Learning
by: Kawata, Ryotaro, et al.
Published: (2025)
by: Kawata, Ryotaro, et al.
Published: (2025)
MetaLA: Unified Optimal Linear Approximation to Softmax Attention Map
by: Chou, Yuhong, et al.
Published: (2024)
by: Chou, Yuhong, et al.
Published: (2024)
The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry
by: Zhang, Michael, et al.
Published: (2024)
by: Zhang, Michael, et al.
Published: (2024)
Dataset Distillation Efficiently Encodes Low-Dimensional Representations from Gradient-Based Learning of Non-Linear Tasks
by: Kinoshita, Yuri, et al.
Published: (2026)
by: Kinoshita, Yuri, et al.
Published: (2026)
Softmax Linear Attention: Reclaiming Global Competition
by: Xu, Mingwei, et al.
Published: (2026)
by: Xu, Mingwei, et al.
Published: (2026)
From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers
by: Kawata, Ryotaro, et al.
Published: (2025)
by: Kawata, Ryotaro, et al.
Published: (2025)
AutoLL: Automatic Linear Layout of Graphs based on Deep Neural Network
by: Watanabe, Chihiro, et al.
Published: (2021)
by: Watanabe, Chihiro, et al.
Published: (2021)
The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge
by: Awano, Ryoya, et al.
Published: (2026)
by: Awano, Ryoya, et al.
Published: (2026)
Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning
by: Xie, Zixuan, et al.
Published: (2026)
by: Xie, Zixuan, et al.
Published: (2026)
Transformers as Measure-Theoretic Associative Memory: A Statistical Perspective and Minimax Optimality
by: Kawata, Ryotaro, et al.
Published: (2026)
by: Kawata, Ryotaro, et al.
Published: (2026)
Universal Approximation with Softmax Attention
by: Hu, Jerry Yao-Chieh, et al.
Published: (2025)
by: Hu, Jerry Yao-Chieh, et al.
Published: (2025)
On the Invariants of Softmax Attention
by: Lee, Wonsuk
Published: (2026)
by: Lee, Wonsuk
Published: (2026)
Optimality and Adaptivity of Deep Neural Features for Instrumental Variable Regression
by: Kim, Juno, et al.
Published: (2025)
by: Kim, Juno, et al.
Published: (2025)
Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective
by: Boursier, Etienne, et al.
Published: (2025)
by: Boursier, Etienne, et al.
Published: (2025)
Scalable-Softmax Is Superior for Attention
by: Nakanishi, Ken M.
Published: (2025)
by: Nakanishi, Ken M.
Published: (2025)
Transformers are Minimax Optimal Nonparametric In-Context Learners
by: Kim, Juno, et al.
Published: (2024)
by: Kim, Juno, et al.
Published: (2024)
Weighted Point Set Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric
by: Uesaka, Toshimitsu, et al.
Published: (2024)
by: Uesaka, Toshimitsu, et al.
Published: (2024)
Softmax Attention with Constant Cost per Token
by: Heinsen, Franz A.
Published: (2024)
by: Heinsen, Franz A.
Published: (2024)
Mamba Can Learn Low-Dimensional Targets In-Context via Test-Time Feature Learning
by: Oh, Junsoo, et al.
Published: (2025)
by: Oh, Junsoo, et al.
Published: (2025)
In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention
by: He, Jianliang, et al.
Published: (2025)
by: He, Jianliang, et al.
Published: (2025)
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
by: Zuhri, Zayd M. K., et al.
Published: (2025)
by: Zuhri, Zayd M. K., et al.
Published: (2025)
Rethinking Attention: Polynomial Alternatives to Softmax in Transformers
by: Saratchandran, Hemanth, et al.
Published: (2024)
by: Saratchandran, Hemanth, et al.
Published: (2024)
On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective
by: Mongaras, Gabriel, et al.
Published: (2025)
by: Mongaras, Gabriel, et al.
Published: (2025)
Customizing the Inductive Biases of Softmax Attention using Structured Matrices
by: Kuang, Yilun, et al.
Published: (2025)
by: Kuang, Yilun, et al.
Published: (2025)
Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation
by: Kim, Juno, et al.
Published: (2025)
by: Kim, Juno, et al.
Published: (2025)
Test time training enhances in-context learning of nonlinear functions
by: Kuwataka, Kento, et al.
Published: (2025)
by: Kuwataka, Kento, et al.
Published: (2025)
In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning
by: Wakayama, Tomoya, et al.
Published: (2025)
by: Wakayama, Tomoya, et al.
Published: (2025)
Deep Two-Way Matrix Reordering for Relational Data Analysis
by: Watanabe, Chihiro, et al.
Published: (2021)
by: Watanabe, Chihiro, et al.
Published: (2021)
Transformers Provably Solve Parity Efficiently with Chain of Thought
by: Kim, Juno, et al.
Published: (2024)
by: Kim, Juno, et al.
Published: (2024)
Mean-field Analysis on Two-layer Neural Networks from a Kernel Perspective
by: Takakura, Shokichi, et al.
Published: (2024)
by: Takakura, Shokichi, et al.
Published: (2024)
Approximation and Estimation Ability of Transformers for Sequence-to-Sequence Functions with Infinite Dimensional Input
by: Takakura, Shokichi, et al.
Published: (2023)
by: Takakura, Shokichi, et al.
Published: (2023)
Similar Items
-
Direct Density Ratio Optimization: A Statistically Consistent Approach to Aligning Large Language Models
by: Higuchi, Rei, et al.
Published: (2025) -
State Space Models are Provably Comparable to Transformers in Dynamic Token Selection
by: Nishikawa, Naoki, et al.
Published: (2024) -
Inference-Aware Meta-Alignment of LLMs via Non-Linear GRPO
by: Takakura, Shokichi, et al.
Published: (2026) -
Transformers Learn Nonlinear Features In Context: Nonconvex Mean-field Dynamics on the Attention Landscape
by: Kim, Juno, et al.
Published: (2024) -
How Neural Reward Models Learn Features for Policy Optimization: A Single-Index Analysis
by: Higuchi, Rei, et al.
Published: (2026)