Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Nishikawa, Naoki, Higuchi, Rei, Suzuki, Taiji
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2507.03340
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912465452793856
author	Nishikawa, Naoki Higuchi, Rei Suzuki, Taiji
author_facet	Nishikawa, Naoki Higuchi, Rei Suzuki, Taiji
contents	Linear attention has attracted interest as a computationally efficient approximation to softmax attention, especially for long sequences. Recent studies have explored distilling softmax attention in pre-trained Transformers into linear attention. However, a critical challenge remains: how to choose the feature dimension that governs the approximation quality. Existing methods fix this dimension uniformly across all attention layers, overlooking the diverse roles and complexities of them. In this paper, we propose a principled method to automatically determine the feature dimension in linear attention using the concept of statistical degrees of freedom, which represent the effective dimensionality of the inputs. We provide a theoretical bound on the approximation error and show that the dimension chosen by our method achieves smaller error under a fixed computational budget. Furthermore, we introduce an efficient layerwise training strategy to learn nonlinear features tailored to each layer. Experiments on multiple pre-trained transformers demonstrate that our method improves the performance of distilled models compared to baselines without increasing the inference cost. Our findings also provide insight into how the complexity of the attention mechanism evolves across layers.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_03340
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Degrees of Freedom for Linear Attention: Distilling Softmax Attention with Optimal Feature Efficiency Nishikawa, Naoki Higuchi, Rei Suzuki, Taiji Machine Learning Linear attention has attracted interest as a computationally efficient approximation to softmax attention, especially for long sequences. Recent studies have explored distilling softmax attention in pre-trained Transformers into linear attention. However, a critical challenge remains: how to choose the feature dimension that governs the approximation quality. Existing methods fix this dimension uniformly across all attention layers, overlooking the diverse roles and complexities of them. In this paper, we propose a principled method to automatically determine the feature dimension in linear attention using the concept of statistical degrees of freedom, which represent the effective dimensionality of the inputs. We provide a theoretical bound on the approximation error and show that the dimension chosen by our method achieves smaller error under a fixed computational budget. Furthermore, we introduce an efficient layerwise training strategy to learn nonlinear features tailored to each layer. Experiments on multiple pre-trained transformers demonstrate that our method improves the performance of distilled models compared to baselines without increasing the inference cost. Our findings also provide insight into how the complexity of the attention mechanism evolves across layers.
title	Degrees of Freedom for Linear Attention: Distilling Softmax Attention with Optimal Feature Efficiency
topic	Machine Learning
url	https://arxiv.org/abs/2507.03340

Similar Items