:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Ferrari, Alan
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2605.28384
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference
by: Jaber, Jaber, et al.
Published: (2026)

Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection
by: Jo, Dongwon, et al.
Published: (2026)

CAST: Clustering Self-Attention using Surrogate Tokens for Efficient Transformers
by: van Engelenhoven, Adjorn, et al.
Published: (2024)

Learning to Route: Per-Sample Adaptive Routing for Multimodal Multitask Prediction
by: Ajirak, Marzieh, et al.
Published: (2025)

CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing
by: Zheng, Wenhao, et al.
Published: (2025)

VecFormer: Towards Efficient and Generalizable Graph Transformer with Graph Token Attention
by: Zhou, Jingbo, et al.
Published: (2026)

Transformers Can Do Bayesian Inference
by: Müller, Samuel, et al.
Published: (2021)

The Bayesian Geometry of Transformer Attention
by: Agarwal, Naman, et al.
Published: (2025)

Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers
by: Norgren, Victor
Published: (2026)

Adaptive Computation Depth via Learned Token Routing in Transformers
by: Mohammed, Ahmed Abdelmuniem Abdalla
Published: (2026)

Token-Efficient RL for LLM Reasoning
by: Lee, Alan, et al.
Published: (2025)

Flow: Per-Instance Personalized Federated Learning Through Dynamic Routing
by: Panchal, Kunjal, et al.
Published: (2022)

Transformers with Joint Tokens and Local-Global Attention for Efficient Human Pose Estimation
by: Kinfu, Kaleab A., et al.
Published: (2025)

Universal Model Routing for Efficient LLM Inference
by: Jitkrittum, Wittawat, et al.
Published: (2025)

ToFe: Lagged Token Freezing and Reusing for Efficient Vision Transformer Inference
by: Zhang, Haoyue, et al.
Published: (2025)

DTRNet: Dynamic Token Routing Network to Reduce Quadratic Costs in Transformers
by: Sharma, Aman, et al.
Published: (2025)

KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization
by: Zhang, Tianyi, et al.
Published: (2024)

Can Transformers Learn Full Bayesian Inference in Context?
by: Reuter, Arik, et al.
Published: (2025)

xPerT: Extended Persistence Transformer
by: Kim, Sehun
Published: (2024)

Bayesian Inverse Problems Meet Flow Matching: Efficient and Flexible Inference via Transformers
by: Sherki, Daniil, et al.
Published: (2025)

Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization
by: Xi, Haocheng, et al.
Published: (2024)

Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers
by: Liang, Yingyu, et al.
Published: (2024)

STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference
by: Guo, Yichen, et al.
Published: (2025)

STS: Efficient Sparse Attention with Speculative Token Sparsity
by: Xu, Ceyu, et al.
Published: (2026)

Adaptive Semantic Token Communication for Transformer-based Edge Inference
by: Devoto, Alessio, et al.
Published: (2025)

Neural Bayesian Sequential Routing
by: Huang, Yongchao
Published: (2026)

In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning
by: Wakayama, Tomoya, et al.
Published: (2025)

Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction
by: Wu, Ziyang, et al.
Published: (2024)

Semiparametric Efficient Inference in Adaptive Experiments
by: Cook, Thomas, et al.
Published: (2023)

IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference
by: Zhong, Wanli, et al.
Published: (2025)

SparQ Attention: Bandwidth-Efficient LLM Inference
by: Ribar, Luka, et al.
Published: (2023)

Learning to Explain: Supervised Token Attribution from Transformer Attention Patterns
by: Mihaila, George
Published: (2026)

Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers
by: Bu, Rui, et al.
Published: (2025)

Unsupervised Multi-Attention Meta Transformer for Rotating Machinery Fault Diagnosis
by: Wang, Hanyang, et al.
Published: (2025)

Route Experts by Sequence, not by Token
by: Wen, Tiansheng, et al.
Published: (2025)

Variational Routing: A Scalable Bayesian Framework for Calibrated Mixture-of-Experts Transformers
by: Li, Albus Yizhuo, et al.
Published: (2026)

Token Sample Complexity of Attention
by: Bohbot, Léa, et al.
Published: (2025)

Assessing Per-Sample Membership Inference Vulnerability without Retraining
by: Dorseuil, Valentin, et al.
Published: (2026)

Distribution Transformers: Fast Approximate Bayesian Inference With On-The-Fly Prior Adaptation
by: Whittle, George, et al.
Published: (2025)

Improving Routing in Sparse Mixture of Experts with Graph of Tokens
by: Nguyen, Tam, et al.
Published: (2025)