Saved in:
| Main Author: | Ferrari, Alan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.28384 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference
by: Jaber, Jaber, et al.
Published: (2026)
by: Jaber, Jaber, et al.
Published: (2026)
Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection
by: Jo, Dongwon, et al.
Published: (2026)
by: Jo, Dongwon, et al.
Published: (2026)
CAST: Clustering Self-Attention using Surrogate Tokens for Efficient Transformers
by: van Engelenhoven, Adjorn, et al.
Published: (2024)
by: van Engelenhoven, Adjorn, et al.
Published: (2024)
Learning to Route: Per-Sample Adaptive Routing for Multimodal Multitask Prediction
by: Ajirak, Marzieh, et al.
Published: (2025)
by: Ajirak, Marzieh, et al.
Published: (2025)
CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing
by: Zheng, Wenhao, et al.
Published: (2025)
by: Zheng, Wenhao, et al.
Published: (2025)
VecFormer: Towards Efficient and Generalizable Graph Transformer with Graph Token Attention
by: Zhou, Jingbo, et al.
Published: (2026)
by: Zhou, Jingbo, et al.
Published: (2026)
Transformers Can Do Bayesian Inference
by: Müller, Samuel, et al.
Published: (2021)
by: Müller, Samuel, et al.
Published: (2021)
The Bayesian Geometry of Transformer Attention
by: Agarwal, Naman, et al.
Published: (2025)
by: Agarwal, Naman, et al.
Published: (2025)
Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers
by: Norgren, Victor
Published: (2026)
by: Norgren, Victor
Published: (2026)
Adaptive Computation Depth via Learned Token Routing in Transformers
by: Mohammed, Ahmed Abdelmuniem Abdalla
Published: (2026)
by: Mohammed, Ahmed Abdelmuniem Abdalla
Published: (2026)
Token-Efficient RL for LLM Reasoning
by: Lee, Alan, et al.
Published: (2025)
by: Lee, Alan, et al.
Published: (2025)
Flow: Per-Instance Personalized Federated Learning Through Dynamic Routing
by: Panchal, Kunjal, et al.
Published: (2022)
by: Panchal, Kunjal, et al.
Published: (2022)
Transformers with Joint Tokens and Local-Global Attention for Efficient Human Pose Estimation
by: Kinfu, Kaleab A., et al.
Published: (2025)
by: Kinfu, Kaleab A., et al.
Published: (2025)
Universal Model Routing for Efficient LLM Inference
by: Jitkrittum, Wittawat, et al.
Published: (2025)
by: Jitkrittum, Wittawat, et al.
Published: (2025)
ToFe: Lagged Token Freezing and Reusing for Efficient Vision Transformer Inference
by: Zhang, Haoyue, et al.
Published: (2025)
by: Zhang, Haoyue, et al.
Published: (2025)
DTRNet: Dynamic Token Routing Network to Reduce Quadratic Costs in Transformers
by: Sharma, Aman, et al.
Published: (2025)
by: Sharma, Aman, et al.
Published: (2025)
KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization
by: Zhang, Tianyi, et al.
Published: (2024)
by: Zhang, Tianyi, et al.
Published: (2024)
Can Transformers Learn Full Bayesian Inference in Context?
by: Reuter, Arik, et al.
Published: (2025)
by: Reuter, Arik, et al.
Published: (2025)
xPerT: Extended Persistence Transformer
by: Kim, Sehun
Published: (2024)
by: Kim, Sehun
Published: (2024)
Bayesian Inverse Problems Meet Flow Matching: Efficient and Flexible Inference via Transformers
by: Sherki, Daniil, et al.
Published: (2025)
by: Sherki, Daniil, et al.
Published: (2025)
Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization
by: Xi, Haocheng, et al.
Published: (2024)
by: Xi, Haocheng, et al.
Published: (2024)
Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers
by: Liang, Yingyu, et al.
Published: (2024)
by: Liang, Yingyu, et al.
Published: (2024)
STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference
by: Guo, Yichen, et al.
Published: (2025)
by: Guo, Yichen, et al.
Published: (2025)
STS: Efficient Sparse Attention with Speculative Token Sparsity
by: Xu, Ceyu, et al.
Published: (2026)
by: Xu, Ceyu, et al.
Published: (2026)
Adaptive Semantic Token Communication for Transformer-based Edge Inference
by: Devoto, Alessio, et al.
Published: (2025)
by: Devoto, Alessio, et al.
Published: (2025)
Neural Bayesian Sequential Routing
by: Huang, Yongchao
Published: (2026)
by: Huang, Yongchao
Published: (2026)
In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning
by: Wakayama, Tomoya, et al.
Published: (2025)
by: Wakayama, Tomoya, et al.
Published: (2025)
Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction
by: Wu, Ziyang, et al.
Published: (2024)
by: Wu, Ziyang, et al.
Published: (2024)
Semiparametric Efficient Inference in Adaptive Experiments
by: Cook, Thomas, et al.
Published: (2023)
by: Cook, Thomas, et al.
Published: (2023)
IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference
by: Zhong, Wanli, et al.
Published: (2025)
by: Zhong, Wanli, et al.
Published: (2025)
SparQ Attention: Bandwidth-Efficient LLM Inference
by: Ribar, Luka, et al.
Published: (2023)
by: Ribar, Luka, et al.
Published: (2023)
Learning to Explain: Supervised Token Attribution from Transformer Attention Patterns
by: Mihaila, George
Published: (2026)
by: Mihaila, George
Published: (2026)
Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers
by: Bu, Rui, et al.
Published: (2025)
by: Bu, Rui, et al.
Published: (2025)
Unsupervised Multi-Attention Meta Transformer for Rotating Machinery Fault Diagnosis
by: Wang, Hanyang, et al.
Published: (2025)
by: Wang, Hanyang, et al.
Published: (2025)
Route Experts by Sequence, not by Token
by: Wen, Tiansheng, et al.
Published: (2025)
by: Wen, Tiansheng, et al.
Published: (2025)
Variational Routing: A Scalable Bayesian Framework for Calibrated Mixture-of-Experts Transformers
by: Li, Albus Yizhuo, et al.
Published: (2026)
by: Li, Albus Yizhuo, et al.
Published: (2026)
Token Sample Complexity of Attention
by: Bohbot, Léa, et al.
Published: (2025)
by: Bohbot, Léa, et al.
Published: (2025)
Assessing Per-Sample Membership Inference Vulnerability without Retraining
by: Dorseuil, Valentin, et al.
Published: (2026)
by: Dorseuil, Valentin, et al.
Published: (2026)
Distribution Transformers: Fast Approximate Bayesian Inference With On-The-Fly Prior Adaptation
by: Whittle, George, et al.
Published: (2025)
by: Whittle, George, et al.
Published: (2025)
Improving Routing in Sparse Mixture of Experts with Graph of Tokens
by: Nguyen, Tam, et al.
Published: (2025)
by: Nguyen, Tam, et al.
Published: (2025)
Similar Items
-
TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference
by: Jaber, Jaber, et al.
Published: (2026) -
Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection
by: Jo, Dongwon, et al.
Published: (2026) -
CAST: Clustering Self-Attention using Surrogate Tokens for Efficient Transformers
by: van Engelenhoven, Adjorn, et al.
Published: (2024) -
Learning to Route: Per-Sample Adaptive Routing for Multimodal Multitask Prediction
by: Ajirak, Marzieh, et al.
Published: (2025) -
CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing
by: Zheng, Wenhao, et al.
Published: (2025)