Saved in:
| Main Author: | Liu, Ziyang |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.18128 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Dual Path Attribution: Efficient Attribution for SwiGLU-Transformers through Layer-Wise Target Propagation
by: Jantsch, Lasse Marten, et al.
Published: (2026)
by: Jantsch, Lasse Marten, et al.
Published: (2026)
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
by: Lau, Tim Tsz-Kit, et al.
Published: (2026)
by: Lau, Tim Tsz-Kit, et al.
Published: (2026)
AudioMAE++: learning better masked audio representations with SwiGLU FFNs
by: Yadav, Sarthak, et al.
Published: (2025)
by: Yadav, Sarthak, et al.
Published: (2025)
GLU Attention Improve Transformer
by: Wang, Zehao
Published: (2025)
by: Wang, Zehao
Published: (2025)
Dependency-Aware Semi-Structured Sparsity of GLU Variants in Large Language Models
by: Guo, Zhiyu, et al.
Published: (2024)
by: Guo, Zhiyu, et al.
Published: (2024)
SwiLTra-Bench: The Swiss Legal Translation Benchmark
by: Niklaus, Joel, et al.
Published: (2025)
by: Niklaus, Joel, et al.
Published: (2025)
Reverse-Engineering the Reader
by: Kiegeland, Samuel, et al.
Published: (2024)
by: Kiegeland, Samuel, et al.
Published: (2024)
Rotate, Clip, and Partition: Towards W2A4KV4 Quantization by Integrating Rotation and Learnable Non-uniform Quantizer
by: Choi, Euntae, et al.
Published: (2025)
by: Choi, Euntae, et al.
Published: (2025)
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
by: Lin, Yujun, et al.
Published: (2024)
by: Lin, Yujun, et al.
Published: (2024)
A Decomposition Perspective to Long-context Reasoning for LLMs
by: Xiao, Yanling, et al.
Published: (2026)
by: Xiao, Yanling, et al.
Published: (2026)
Bayesian WeakS-to-Strong from Text Classification to Generation
by: Cui, Ziyun, et al.
Published: (2024)
by: Cui, Ziyun, et al.
Published: (2024)
Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis
by: Kapl, Ferdinand, et al.
Published: (2025)
by: Kapl, Ferdinand, et al.
Published: (2025)
Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization
by: Chen, Hung-Hsuan
Published: (2026)
by: Chen, Hung-Hsuan
Published: (2026)
Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers
by: Kohli, Harsh, et al.
Published: (2026)
by: Kohli, Harsh, et al.
Published: (2026)
DepthCharge: A Domain-Agnostic Framework for Measuring Depth-Dependent Knowledge in Large Language Models
by: Sheppert, Alexander
Published: (2026)
by: Sheppert, Alexander
Published: (2026)
The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?
by: Lyu, Xingyu, et al.
Published: (2026)
by: Lyu, Xingyu, et al.
Published: (2026)
Multi-Token Prediction Needs Registers
by: Gerontopoulos, Anastasios, et al.
Published: (2025)
by: Gerontopoulos, Anastasios, et al.
Published: (2025)
Unlocking Continual Learning Abilities in Language Models
by: Du, Wenyu, et al.
Published: (2024)
by: Du, Wenyu, et al.
Published: (2024)
A Sea of Words: An In-Depth Analysis of Anchors for Text Data
by: Lopardo, Gianluigi, et al.
Published: (2022)
by: Lopardo, Gianluigi, et al.
Published: (2022)
Distill and Align Decomposition for Enhanced Claim Verification
by: Magomere, Jabez, et al.
Published: (2026)
by: Magomere, Jabez, et al.
Published: (2026)
CL4KGE: A Curriculum Learning Method for Knowledge Graph Embedding
by: Liu, Yang, et al.
Published: (2024)
by: Liu, Yang, et al.
Published: (2024)
Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer
by: Lu, Wenquan, et al.
Published: (2025)
by: Lu, Wenquan, et al.
Published: (2025)
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
by: Gringras, David
Published: (2026)
by: Gringras, David
Published: (2026)
ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns
by: Zhao, Ziyu, et al.
Published: (2026)
by: Zhao, Ziyu, et al.
Published: (2026)
Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training
by: Liu, Mingjie, et al.
Published: (2025)
by: Liu, Mingjie, et al.
Published: (2025)
Mode-Conditioning Unlocks Superior Test-Time Scaling
by: Wu, Chen Henry, et al.
Published: (2025)
by: Wu, Chen Henry, et al.
Published: (2025)
Direct Behavior Optimization: Unlocking the Potential of Lightweight LLMs
by: Yang, Hongming, et al.
Published: (2025)
by: Yang, Hongming, et al.
Published: (2025)
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models
by: Liu, Akide, et al.
Published: (2024)
by: Liu, Akide, et al.
Published: (2024)
GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery
by: Wang, Enguang, et al.
Published: (2024)
by: Wang, Enguang, et al.
Published: (2024)
Supervised Fine-Tuning Needs to Unlock the Potential of Token Priority
by: Shen, Zhanming, et al.
Published: (2026)
by: Shen, Zhanming, et al.
Published: (2026)
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
by: Xiaomi, LLM-Core, et al.
Published: (2025)
by: Xiaomi, LLM-Core, et al.
Published: (2025)
Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration
by: Deng, Wenhao, et al.
Published: (2025)
by: Deng, Wenhao, et al.
Published: (2025)
Unlocking In-Context Learning for Natural Datasets Beyond Language Modelling
by: Bratulić, Jelena, et al.
Published: (2025)
by: Bratulić, Jelena, et al.
Published: (2025)
Unlocking Multimodal Mathematical Reasoning via Process Reward Model
by: Luo, Ruilin, et al.
Published: (2025)
by: Luo, Ruilin, et al.
Published: (2025)
DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning
by: Li, Chengpeng, et al.
Published: (2024)
by: Li, Chengpeng, et al.
Published: (2024)
Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation
by: Li, Ziniu, et al.
Published: (2025)
by: Li, Ziniu, et al.
Published: (2025)
Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts
by: Gu, Naibin, et al.
Published: (2025)
by: Gu, Naibin, et al.
Published: (2025)
Sparse Attention Decomposition Applied to Circuit Tracing
by: Franco, Gabriel, et al.
Published: (2024)
by: Franco, Gabriel, et al.
Published: (2024)
Information-Theoretic Reward Decomposition for Generalizable RLHF
by: Mao, Liyuan, et al.
Published: (2025)
by: Mao, Liyuan, et al.
Published: (2025)
PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching
by: Chen, Ruishuo, et al.
Published: (2026)
by: Chen, Ruishuo, et al.
Published: (2026)
Similar Items
-
Dual Path Attribution: Efficient Attribution for SwiGLU-Transformers through Layer-Wise Target Propagation
by: Jantsch, Lasse Marten, et al.
Published: (2026) -
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
by: Lau, Tim Tsz-Kit, et al.
Published: (2026) -
AudioMAE++: learning better masked audio representations with SwiGLU FFNs
by: Yadav, Sarthak, et al.
Published: (2025) -
GLU Attention Improve Transformer
by: Wang, Zehao
Published: (2025) -
Dependency-Aware Semi-Structured Sparsity of GLU Variants in Large Language Models
by: Guo, Zhiyu, et al.
Published: (2024)