:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ternovtsii, Ivan, Bilak, Yurii
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.09516
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Geometric Routing Enables Causal Expert Control in Mixture of Experts
by: Ternovtsii, Ivan, et al.
Published: (2026)

Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality
by: Ternovtsii, Ivan, et al.
Published: (2026)

Cosine-Similarity Routing with Semantic Anchors for Interpretable Mixture-of-Experts Language Models
by: Ternovtsii, Ivan, et al.
Published: (2025)

Mixture of Attentions For Speculative Decoding
by: Zimmer, Matthieu, et al.
Published: (2024)

MLPMoE: Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts
by: Novikov, Ivan
Published: (2025)

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
by: Zhou, Ruijie, et al.
Published: (2026)

Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves
by: Knupp, Jonas, et al.
Published: (2026)

Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders
by: Oldfield, James, et al.
Published: (2025)

Gradient Boosting within a Single Attention Layer
by: Sargolzaei, Saleh
Published: (2026)

Data-Free Pruning of Self-Attention Layers in LLMs
by: Saikumar, Dhananjay, et al.
Published: (2025)

Dynamic Adaptive Shared Experts with Grouped Multi-Head Attention Mixture of Experts
by: Li, Cheng, et al.
Published: (2025)

Sigmoid Self-Attention has Lower Sample Complexity than Softmax Self-Attention: A Mixture-of-Experts Perspective
by: Yan, Fanqi, et al.
Published: (2025)

On the Spatial Structure of Mixture-of-Experts in Transformers
by: Bershatsky, Daniel, et al.
Published: (2025)

RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts
by: Joshi, Sahil, et al.
Published: (2025)

Fate: Fast Edge Inference of Mixture-of-Experts Models via Cross-Layer Gate
by: Fang, Zhiyuan, et al.
Published: (2025)

HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models
by: Wei, Jia, et al.
Published: (2026)

Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design
by: Li, Junzhuo, et al.
Published: (2026)

Multi-Layer Attention-Based Explainability via Transformers for Tabular Data
by: Gavito, Andrea Treviño, et al.
Published: (2023)

MoH: Multi-Head Attention as Mixture-of-Head Attention
by: Jin, Peng, et al.
Published: (2024)

MoBA: Mixture of Block Attention for Long-Context LLMs
by: Lu, Enzhe, et al.
Published: (2025)

InfoMamba: An Attention-Free Hybrid Mamba-Transformer Model
by: Wang, Youjin, et al.
Published: (2026)

Hybrid Focal and Full-Range Attention Based Graph Transformers
by: Zhu, Minhong, et al.
Published: (2023)

Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction
by: Xia, Xiaojie, et al.
Published: (2026)

LNUCB-TA: Linear-nonlinear Hybrid Bandit Learning with Temporal Attention
by: Khosravi, Hamed, et al.
Published: (2025)

Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis
by: Fartale, Harshwardhan, et al.
Published: (2025)

Attention Saturation and Gradient Suppression at Inflection Layers: Diagnosing and Mitigating Bottlenecks in Transformer Adaptation
by: Zixian, Wang
Published: (2025)

Mixture of Experts in a Mixture of RL settings
by: Willi, Timon, et al.
Published: (2024)

MC#: Mixture Compressor for Mixture-of-Experts Large Models
by: Huang, Wei, et al.
Published: (2025)

SETransformer: A Hybrid Attention-Based Architecture for Robust Human Activity Recognition
by: Liu, Yunbo, et al.
Published: (2025)

ODE-ViT: Plug & Play Attention Layer from the Generalization of the ViT as an Ordinary Differential Equation
by: Riera, Carlos Boned, et al.
Published: (2025)

DynaMoE: Dynamic Token-Level Expert Activation with Layer-Wise Adaptive Capacity for Mixture-of-Experts Neural Networks
by: Gülmez, Gökdeniz
Published: (2026)

Native Hybrid Attention for Efficient Sequence Modeling
by: Du, Jusen, et al.
Published: (2025)

Implicit Regularization of Gradient Flow on One-Layer Softmax Attention
by: Sheen, Heejune, et al.
Published: (2024)

MixGCN: Scalable GCN Training by Mixture of Parallelism and Mixture of Accelerators
by: Wan, Cheng, et al.
Published: (2025)

Mixture of Raytraced Experts
by: Perin, Andrea, et al.
Published: (2025)

Stock Market Price Prediction: A Hybrid LSTM and Sequential Self-Attention based Approach
by: Pardeshi, Karan, et al.
Published: (2023)

Residual GRU+MHSA: A Lightweight Hybrid Recurrent Attention Model for Cardiovascular Disease Detection
by: Dash, Tejaswani, et al.
Published: (2025)

MixtureKit: A General Framework for Composing, Training, and Visualizing Mixture-of-Experts Models
by: Chamma, Ahmad, et al.
Published: (2025)

From Molecules to Mixtures: Learning Representations of Olfactory Mixture Similarity using Inductive Biases
by: Tom, Gary, et al.
Published: (2025)

SpatialMAGIC: A Hybrid Framework Integrating Graph Diffusion and Spatial Attention for Spatial Transcriptomics Imputation
by: Zaman, Sayeem Bin, et al.
Published: (2026)