Saved in:
| Main Authors: | Ternovtsii, Ivan, Bilak, Yurii |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.09516 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Geometric Routing Enables Causal Expert Control in Mixture of Experts
by: Ternovtsii, Ivan, et al.
Published: (2026)
by: Ternovtsii, Ivan, et al.
Published: (2026)
Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality
by: Ternovtsii, Ivan, et al.
Published: (2026)
by: Ternovtsii, Ivan, et al.
Published: (2026)
Cosine-Similarity Routing with Semantic Anchors for Interpretable Mixture-of-Experts Language Models
by: Ternovtsii, Ivan, et al.
Published: (2025)
by: Ternovtsii, Ivan, et al.
Published: (2025)
Mixture of Attentions For Speculative Decoding
by: Zimmer, Matthieu, et al.
Published: (2024)
by: Zimmer, Matthieu, et al.
Published: (2024)
MLPMoE: Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts
by: Novikov, Ivan
Published: (2025)
by: Novikov, Ivan
Published: (2025)
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
by: Zhou, Ruijie, et al.
Published: (2026)
by: Zhou, Ruijie, et al.
Published: (2026)
Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves
by: Knupp, Jonas, et al.
Published: (2026)
by: Knupp, Jonas, et al.
Published: (2026)
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders
by: Oldfield, James, et al.
Published: (2025)
by: Oldfield, James, et al.
Published: (2025)
Gradient Boosting within a Single Attention Layer
by: Sargolzaei, Saleh
Published: (2026)
by: Sargolzaei, Saleh
Published: (2026)
Data-Free Pruning of Self-Attention Layers in LLMs
by: Saikumar, Dhananjay, et al.
Published: (2025)
by: Saikumar, Dhananjay, et al.
Published: (2025)
Dynamic Adaptive Shared Experts with Grouped Multi-Head Attention Mixture of Experts
by: Li, Cheng, et al.
Published: (2025)
by: Li, Cheng, et al.
Published: (2025)
Sigmoid Self-Attention has Lower Sample Complexity than Softmax Self-Attention: A Mixture-of-Experts Perspective
by: Yan, Fanqi, et al.
Published: (2025)
by: Yan, Fanqi, et al.
Published: (2025)
On the Spatial Structure of Mixture-of-Experts in Transformers
by: Bershatsky, Daniel, et al.
Published: (2025)
by: Bershatsky, Daniel, et al.
Published: (2025)
RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts
by: Joshi, Sahil, et al.
Published: (2025)
by: Joshi, Sahil, et al.
Published: (2025)
Fate: Fast Edge Inference of Mixture-of-Experts Models via Cross-Layer Gate
by: Fang, Zhiyuan, et al.
Published: (2025)
by: Fang, Zhiyuan, et al.
Published: (2025)
HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models
by: Wei, Jia, et al.
Published: (2026)
by: Wei, Jia, et al.
Published: (2026)
Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design
by: Li, Junzhuo, et al.
Published: (2026)
by: Li, Junzhuo, et al.
Published: (2026)
Multi-Layer Attention-Based Explainability via Transformers for Tabular Data
by: Gavito, Andrea Treviño, et al.
Published: (2023)
by: Gavito, Andrea Treviño, et al.
Published: (2023)
MoH: Multi-Head Attention as Mixture-of-Head Attention
by: Jin, Peng, et al.
Published: (2024)
by: Jin, Peng, et al.
Published: (2024)
MoBA: Mixture of Block Attention for Long-Context LLMs
by: Lu, Enzhe, et al.
Published: (2025)
by: Lu, Enzhe, et al.
Published: (2025)
InfoMamba: An Attention-Free Hybrid Mamba-Transformer Model
by: Wang, Youjin, et al.
Published: (2026)
by: Wang, Youjin, et al.
Published: (2026)
Hybrid Focal and Full-Range Attention Based Graph Transformers
by: Zhu, Minhong, et al.
Published: (2023)
by: Zhu, Minhong, et al.
Published: (2023)
Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction
by: Xia, Xiaojie, et al.
Published: (2026)
by: Xia, Xiaojie, et al.
Published: (2026)
LNUCB-TA: Linear-nonlinear Hybrid Bandit Learning with Temporal Attention
by: Khosravi, Hamed, et al.
Published: (2025)
by: Khosravi, Hamed, et al.
Published: (2025)
Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis
by: Fartale, Harshwardhan, et al.
Published: (2025)
by: Fartale, Harshwardhan, et al.
Published: (2025)
Attention Saturation and Gradient Suppression at Inflection Layers: Diagnosing and Mitigating Bottlenecks in Transformer Adaptation
by: Zixian, Wang
Published: (2025)
by: Zixian, Wang
Published: (2025)
Mixture of Experts in a Mixture of RL settings
by: Willi, Timon, et al.
Published: (2024)
by: Willi, Timon, et al.
Published: (2024)
MC#: Mixture Compressor for Mixture-of-Experts Large Models
by: Huang, Wei, et al.
Published: (2025)
by: Huang, Wei, et al.
Published: (2025)
SETransformer: A Hybrid Attention-Based Architecture for Robust Human Activity Recognition
by: Liu, Yunbo, et al.
Published: (2025)
by: Liu, Yunbo, et al.
Published: (2025)
ODE-ViT: Plug & Play Attention Layer from the Generalization of the ViT as an Ordinary Differential Equation
by: Riera, Carlos Boned, et al.
Published: (2025)
by: Riera, Carlos Boned, et al.
Published: (2025)
DynaMoE: Dynamic Token-Level Expert Activation with Layer-Wise Adaptive Capacity for Mixture-of-Experts Neural Networks
by: Gülmez, Gökdeniz
Published: (2026)
by: Gülmez, Gökdeniz
Published: (2026)
Native Hybrid Attention for Efficient Sequence Modeling
by: Du, Jusen, et al.
Published: (2025)
by: Du, Jusen, et al.
Published: (2025)
Implicit Regularization of Gradient Flow on One-Layer Softmax Attention
by: Sheen, Heejune, et al.
Published: (2024)
by: Sheen, Heejune, et al.
Published: (2024)
MixGCN: Scalable GCN Training by Mixture of Parallelism and Mixture of Accelerators
by: Wan, Cheng, et al.
Published: (2025)
by: Wan, Cheng, et al.
Published: (2025)
Mixture of Raytraced Experts
by: Perin, Andrea, et al.
Published: (2025)
by: Perin, Andrea, et al.
Published: (2025)
Stock Market Price Prediction: A Hybrid LSTM and Sequential Self-Attention based Approach
by: Pardeshi, Karan, et al.
Published: (2023)
by: Pardeshi, Karan, et al.
Published: (2023)
Residual GRU+MHSA: A Lightweight Hybrid Recurrent Attention Model for Cardiovascular Disease Detection
by: Dash, Tejaswani, et al.
Published: (2025)
by: Dash, Tejaswani, et al.
Published: (2025)
MixtureKit: A General Framework for Composing, Training, and Visualizing Mixture-of-Experts Models
by: Chamma, Ahmad, et al.
Published: (2025)
by: Chamma, Ahmad, et al.
Published: (2025)
From Molecules to Mixtures: Learning Representations of Olfactory Mixture Similarity using Inductive Biases
by: Tom, Gary, et al.
Published: (2025)
by: Tom, Gary, et al.
Published: (2025)
SpatialMAGIC: A Hybrid Framework Integrating Graph Diffusion and Spatial Attention for Spatial Transcriptomics Imputation
by: Zaman, Sayeem Bin, et al.
Published: (2026)
by: Zaman, Sayeem Bin, et al.
Published: (2026)
Similar Items
-
Geometric Routing Enables Causal Expert Control in Mixture of Experts
by: Ternovtsii, Ivan, et al.
Published: (2026) -
Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality
by: Ternovtsii, Ivan, et al.
Published: (2026) -
Cosine-Similarity Routing with Semantic Anchors for Interpretable Mixture-of-Experts Language Models
by: Ternovtsii, Ivan, et al.
Published: (2025) -
Mixture of Attentions For Speculative Decoding
by: Zimmer, Matthieu, et al.
Published: (2024) -
MLPMoE: Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts
by: Novikov, Ivan
Published: (2025)