Saved in:
| Main Authors: | Luo, Shuqing, Guan, Yilin, Li, Pingzhi, Wang, Hanrui, Chen, Tianlong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.07486 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Hexa-MoE: Efficient and Heterogeneous-aware Training for Mixture-of-Experts
by: Luo, Shuqing, et al.
Published: (2024)
by: Luo, Shuqing, et al.
Published: (2024)
AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention
by: Hu, Yuxuan, et al.
Published: (2026)
by: Hu, Yuxuan, et al.
Published: (2026)
Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference
by: Luo, Shuqing, et al.
Published: (2025)
by: Luo, Shuqing, et al.
Published: (2025)
ORAL: Prompting Your Large-Scale LoRAs via Conditional Recurrent Diffusion
by: Khan, Rana Muhammad Shahroz, et al.
Published: (2025)
by: Khan, Rana Muhammad Shahroz, et al.
Published: (2025)
AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR
by: Nguyen, Tuan, et al.
Published: (2025)
by: Nguyen, Tuan, et al.
Published: (2025)
ATTS: Asynchronous Test-Time Scaling via Conformal Prediction
by: Xiong, Jing, et al.
Published: (2025)
by: Xiong, Jing, et al.
Published: (2025)
Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling
by: Sun, Shengyin, et al.
Published: (2025)
by: Sun, Shengyin, et al.
Published: (2025)
Can GRPO Help LLMs Transcend Their Pretraining Origin?
by: Ni, Kangqi, et al.
Published: (2025)
by: Ni, Kangqi, et al.
Published: (2025)
QuantMoE-Bench: Examining Post-Training Quantization for Mixture-of-Experts
by: Li, Pingzhi, et al.
Published: (2024)
by: Li, Pingzhi, et al.
Published: (2024)
Q-Newton: Hybrid Quantum-Classical Scheduling for Accelerating Neural Network Training with Newton's Gradient Descent
by: Li, Pingzhi, et al.
Published: (2024)
by: Li, Pingzhi, et al.
Published: (2024)
SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
by: Wang, Hanrui, et al.
Published: (2020)
by: Wang, Hanrui, et al.
Published: (2020)
Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert Parallelism Design
by: Zhang, Mohan, et al.
Published: (2025)
by: Zhang, Mohan, et al.
Published: (2025)
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy
by: Li, Pingzhi, et al.
Published: (2023)
by: Li, Pingzhi, et al.
Published: (2023)
AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures
by: Liu, Jie, et al.
Published: (2026)
by: Liu, Jie, et al.
Published: (2026)
DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation
by: Li, Pingzhi, et al.
Published: (2025)
by: Li, Pingzhi, et al.
Published: (2025)
Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding
by: Wang, Yiming, et al.
Published: (2025)
by: Wang, Yiming, et al.
Published: (2025)
Parallel Loop Transformer for Efficient Test-Time Computation Scaling
by: Wu, Bohong, et al.
Published: (2025)
by: Wu, Bohong, et al.
Published: (2025)
Glider: Global and Local Instruction-Driven Expert Router
by: Li, Pingzhi, et al.
Published: (2024)
by: Li, Pingzhi, et al.
Published: (2024)
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
by: Chen, Zigeng, et al.
Published: (2024)
by: Chen, Zigeng, et al.
Published: (2024)
AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs
by: Cui, Chengming, et al.
Published: (2026)
by: Cui, Chengming, et al.
Published: (2026)
Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures
by: Luo, Shuqing, et al.
Published: (2026)
by: Luo, Shuqing, et al.
Published: (2026)
Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding
by: Jin, Tian, et al.
Published: (2025)
by: Jin, Tian, et al.
Published: (2025)
An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models
by: Luo, Hanrui, et al.
Published: (2026)
by: Luo, Hanrui, et al.
Published: (2026)
TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration
by: Wei, Linye, et al.
Published: (2026)
by: Wei, Linye, et al.
Published: (2026)
Efficient Many-Shot In-Context Learning with Dynamic Block-Sparse Attention
by: Xiao, Emily, et al.
Published: (2025)
by: Xiao, Emily, et al.
Published: (2025)
Arch: An AI-Native Hardware Description Language for Register-Transfer Clocked Hardware Design
by: Zhao, Shuqing
Published: (2026)
by: Zhao, Shuqing
Published: (2026)
PortLLM: Personalizing Evolving Large Language Models with Training-Free and Portable Model Patches
by: Khan, Rana Muhammad Shahroz, et al.
Published: (2024)
by: Khan, Rana Muhammad Shahroz, et al.
Published: (2024)
Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling
by: Kuang, Peng, et al.
Published: (2025)
by: Kuang, Peng, et al.
Published: (2025)
Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
by: Li, Pengxiang, et al.
Published: (2026)
by: Li, Pengxiang, et al.
Published: (2026)
Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark
by: Zhang, Yihua, et al.
Published: (2024)
by: Zhang, Yihua, et al.
Published: (2024)
Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild
by: Zhao, Xinyu, et al.
Published: (2024)
by: Zhao, Xinyu, et al.
Published: (2024)
$\texttt{MoE-RBench}$: Towards Building Reliable Language Models with Sparse Mixture-of-Experts
by: Chen, Guanjie, et al.
Published: (2024)
by: Chen, Guanjie, et al.
Published: (2024)
Iterative Deepening Sampling as Efficient Test-Time Scaling
by: Chen, Weizhe, et al.
Published: (2025)
by: Chen, Weizhe, et al.
Published: (2025)
MatryoshkaThinking: Recursive Test-Time Scaling Enables Efficient Reasoning
by: Chen, Hongwei, et al.
Published: (2025)
by: Chen, Hongwei, et al.
Published: (2025)
Divide and Conquer: Accelerating Diffusion-Based Large Language Models via Adaptive Parallel Decoding
by: Luo, Xiangzhong, et al.
Published: (2026)
by: Luo, Xiangzhong, et al.
Published: (2026)
AsyncMDE: Real-Time Monocular Depth Estimation via Asynchronous Spatial Memory
by: Ma, Lianjie, et al.
Published: (2026)
by: Ma, Lianjie, et al.
Published: (2026)
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
by: MiniMax, et al.
Published: (2025)
by: MiniMax, et al.
Published: (2025)
DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs
by: Tan, Zhen, et al.
Published: (2024)
by: Tan, Zhen, et al.
Published: (2024)
Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence
by: Ghasemabadi, Amirhosein, et al.
Published: (2025)
by: Ghasemabadi, Amirhosein, et al.
Published: (2025)
Accelerate Speculative Decoding with Sparse Computation in Verification
by: Wang, Jikai, et al.
Published: (2025)
by: Wang, Jikai, et al.
Published: (2025)
Similar Items
-
Hexa-MoE: Efficient and Heterogeneous-aware Training for Mixture-of-Experts
by: Luo, Shuqing, et al.
Published: (2024) -
AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention
by: Hu, Yuxuan, et al.
Published: (2026) -
Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference
by: Luo, Shuqing, et al.
Published: (2025) -
ORAL: Prompting Your Large-Scale LoRAs via Conditional Recurrent Diffusion
by: Khan, Rana Muhammad Shahroz, et al.
Published: (2025) -
AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR
by: Nguyen, Tuan, et al.
Published: (2025)