:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Luo, Shuqing, Guan, Yilin, Li, Pingzhi, Wang, Hanrui, Chen, Tianlong
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2510.07486
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Hexa-MoE: Efficient and Heterogeneous-aware Training for Mixture-of-Experts
by: Luo, Shuqing, et al.
Published: (2024)

AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention
by: Hu, Yuxuan, et al.
Published: (2026)

Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference
by: Luo, Shuqing, et al.
Published: (2025)

ORAL: Prompting Your Large-Scale LoRAs via Conditional Recurrent Diffusion
by: Khan, Rana Muhammad Shahroz, et al.
Published: (2025)

AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR
by: Nguyen, Tuan, et al.
Published: (2025)

ATTS: Asynchronous Test-Time Scaling via Conformal Prediction
by: Xiong, Jing, et al.
Published: (2025)

Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling
by: Sun, Shengyin, et al.
Published: (2025)

Can GRPO Help LLMs Transcend Their Pretraining Origin?
by: Ni, Kangqi, et al.
Published: (2025)

QuantMoE-Bench: Examining Post-Training Quantization for Mixture-of-Experts
by: Li, Pingzhi, et al.
Published: (2024)

Q-Newton: Hybrid Quantum-Classical Scheduling for Accelerating Neural Network Training with Newton's Gradient Descent
by: Li, Pingzhi, et al.
Published: (2024)

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
by: Wang, Hanrui, et al.
Published: (2020)

Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert Parallelism Design
by: Zhang, Mohan, et al.
Published: (2025)

Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy
by: Li, Pingzhi, et al.
Published: (2023)

AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures
by: Liu, Jie, et al.
Published: (2026)

DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation
by: Li, Pingzhi, et al.
Published: (2025)

Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding
by: Wang, Yiming, et al.
Published: (2025)

Parallel Loop Transformer for Efficient Test-Time Computation Scaling
by: Wu, Bohong, et al.
Published: (2025)

Glider: Global and Local Instruction-Driven Expert Router
by: Li, Pingzhi, et al.
Published: (2024)

AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
by: Chen, Zigeng, et al.
Published: (2024)

AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs
by: Cui, Chengming, et al.
Published: (2026)

Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures
by: Luo, Shuqing, et al.
Published: (2026)

Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding
by: Jin, Tian, et al.
Published: (2025)

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models
by: Luo, Hanrui, et al.
Published: (2026)

TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration
by: Wei, Linye, et al.
Published: (2026)

Efficient Many-Shot In-Context Learning with Dynamic Block-Sparse Attention
by: Xiao, Emily, et al.
Published: (2025)

Arch: An AI-Native Hardware Description Language for Register-Transfer Clocked Hardware Design
by: Zhao, Shuqing
Published: (2026)

PortLLM: Personalizing Evolving Large Language Models with Training-Free and Portable Model Patches
by: Khan, Rana Muhammad Shahroz, et al.
Published: (2024)

Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling
by: Kuang, Peng, et al.
Published: (2025)

Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
by: Li, Pengxiang, et al.
Published: (2026)

Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark
by: Zhang, Yihua, et al.
Published: (2024)

Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild
by: Zhao, Xinyu, et al.
Published: (2024)

$\texttt{MoE-RBench}$: Towards Building Reliable Language Models with Sparse Mixture-of-Experts
by: Chen, Guanjie, et al.
Published: (2024)

Iterative Deepening Sampling as Efficient Test-Time Scaling
by: Chen, Weizhe, et al.
Published: (2025)

MatryoshkaThinking: Recursive Test-Time Scaling Enables Efficient Reasoning
by: Chen, Hongwei, et al.
Published: (2025)

Divide and Conquer: Accelerating Diffusion-Based Large Language Models via Adaptive Parallel Decoding
by: Luo, Xiangzhong, et al.
Published: (2026)

AsyncMDE: Real-Time Monocular Depth Estimation via Asynchronous Spatial Memory
by: Ma, Lianjie, et al.
Published: (2026)

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
by: MiniMax, et al.
Published: (2025)

DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs
by: Tan, Zhen, et al.
Published: (2024)

Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence
by: Ghasemabadi, Amirhosein, et al.
Published: (2025)

Accelerate Speculative Decoding with Sparse Computation in Verification
by: Wang, Jikai, et al.
Published: (2025)