:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Nishikawa, Naoki, Higuchi, Rei, Suzuki, Taiji
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2507.03340
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Direct Density Ratio Optimization: A Statistically Consistent Approach to Aligning Large Language Models
by: Higuchi, Rei, et al.
Published: (2025)

State Space Models are Provably Comparable to Transformers in Dynamic Token Selection
by: Nishikawa, Naoki, et al.
Published: (2024)

Inference-Aware Meta-Alignment of LLMs via Non-Linear GRPO
by: Takakura, Shokichi, et al.
Published: (2026)

Transformers Learn Nonlinear Features In Context: Nonconvex Mean-field Dynamics on the Attention Landscape
by: Kim, Juno, et al.
Published: (2024)

How Neural Reward Models Learn Features for Policy Optimization: A Single-Index Analysis
by: Higuchi, Rei, et al.
Published: (2026)

When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars
by: Higuchi, Rei, et al.
Published: (2025)

Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression
by: Zuo, Yifei, et al.
Published: (2025)

Why Softmax Attention Outperforms Linear Attention
by: Deng, Yichuan, et al.
Published: (2023)

A Relative-Budget Theory for Reinforcement Learning with Verifiable Rewards in Large Language Model Reasoning
by: Wachi, Akifumi, et al.
Published: (2026)

Mixture of Experts Provably Detect and Learn the Latent Cluster Structure in Gradient-Based Learning
by: Kawata, Ryotaro, et al.
Published: (2025)

MetaLA: Unified Optimal Linear Approximation to Softmax Attention Map
by: Chou, Yuhong, et al.
Published: (2024)

The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry
by: Zhang, Michael, et al.
Published: (2024)

Dataset Distillation Efficiently Encodes Low-Dimensional Representations from Gradient-Based Learning of Non-Linear Tasks
by: Kinoshita, Yuri, et al.
Published: (2026)

Softmax Linear Attention: Reclaiming Global Competition
by: Xu, Mingwei, et al.
Published: (2026)

From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers
by: Kawata, Ryotaro, et al.
Published: (2025)

AutoLL: Automatic Linear Layout of Graphs based on Deep Neural Network
by: Watanabe, Chihiro, et al.
Published: (2021)

The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge
by: Awano, Ryoya, et al.
Published: (2026)

Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning
by: Xie, Zixuan, et al.
Published: (2026)

Transformers as Measure-Theoretic Associative Memory: A Statistical Perspective and Minimax Optimality
by: Kawata, Ryotaro, et al.
Published: (2026)

Universal Approximation with Softmax Attention
by: Hu, Jerry Yao-Chieh, et al.
Published: (2025)

On the Invariants of Softmax Attention
by: Lee, Wonsuk
Published: (2026)

Optimality and Adaptivity of Deep Neural Features for Instrumental Variable Regression
by: Kim, Juno, et al.
Published: (2025)

Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective
by: Boursier, Etienne, et al.
Published: (2025)

Scalable-Softmax Is Superior for Attention
by: Nakanishi, Ken M.
Published: (2025)

Transformers are Minimax Optimal Nonparametric In-Context Learners
by: Kim, Juno, et al.
Published: (2024)

Weighted Point Set Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric
by: Uesaka, Toshimitsu, et al.
Published: (2024)

Softmax Attention with Constant Cost per Token
by: Heinsen, Franz A.
Published: (2024)

Mamba Can Learn Low-Dimensional Targets In-Context via Test-Time Feature Learning
by: Oh, Junsoo, et al.
Published: (2025)

In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention
by: He, Jianliang, et al.
Published: (2025)

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
by: Zuhri, Zayd M. K., et al.
Published: (2025)

Rethinking Attention: Polynomial Alternatives to Softmax in Transformers
by: Saratchandran, Hemanth, et al.
Published: (2024)

On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective
by: Mongaras, Gabriel, et al.
Published: (2025)

Customizing the Inductive Biases of Softmax Attention using Structured Matrices
by: Kuang, Yilun, et al.
Published: (2025)

Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation
by: Kim, Juno, et al.
Published: (2025)

Test time training enhances in-context learning of nonlinear functions
by: Kuwataka, Kento, et al.
Published: (2025)

In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning
by: Wakayama, Tomoya, et al.
Published: (2025)

Deep Two-Way Matrix Reordering for Relational Data Analysis
by: Watanabe, Chihiro, et al.
Published: (2021)

Transformers Provably Solve Parity Efficiently with Chain of Thought
by: Kim, Juno, et al.
Published: (2024)

Mean-field Analysis on Two-layer Neural Networks from a Kernel Perspective
by: Takakura, Shokichi, et al.
Published: (2024)

Approximation and Estimation Ability of Transformers for Sequence-to-Sequence Functions with Infinite Dimensional Input
by: Takakura, Shokichi, et al.
Published: (2023)