:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Liu, Xiao, Zhang, Lijun, Ganesan, Deepak, Guan, Hui
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2505.19342
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Reimagining Parameter Space Exploration with Diffusion Models
by: Zhang, Lijun, et al.
Published: (2025)

Aligned Vector Quantization for Edge-Cloud Collabrative Vision-Language Models
by: Liu, Xiao, et al.
Published: (2024)

WildFit: Autonomous In-situ Model Adaptation for Resource-Constrained IoT Systems
by: Rastikerdar, Mohammad Mehdi, et al.
Published: (2024)

ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering
by: Guo, Xiaoke, et al.
Published: (2026)

Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator
by: Ramachandran, Akshat, et al.
Published: (2025)

FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning
by: Liu, Junkang, et al.
Published: (2026)

Communication-Efficient Federated Learning with Accelerated Client Gradient
by: Kim, Geeho, et al.
Published: (2022)

Accelerating Transformer Inference for Translation via Parallel Decoding
by: Santilli, Andrea, et al.
Published: (2023)

RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference
by: Gautam, Arpit Singh, et al.
Published: (2026)

FedSI: Federated Subnetwork Inference for Efficient Uncertainty Quantification
by: Chen, Hui, et al.
Published: (2024)

PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices
by: Liu, Yangyijian, et al.
Published: (2025)

Accelerate Model Parallel Training by Using Efficient Graph Traversal Order in Device Placement
by: Wang, Tianze, et al.
Published: (2022)

FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction
by: Cai, Yuxuan, et al.
Published: (2025)

QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations
by: Zhao, Zhixiong, et al.
Published: (2025)

Accelerating Transformer Inference and Training with 2:4 Activation Sparsity
by: Haziza, Daniel, et al.
Published: (2025)

Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference
by: Jaradat, Ghadeer, et al.
Published: (2024)

Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices
by: Han, Xueyuan, et al.
Published: (2024)

Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference
by: Skliar, Andrii, et al.
Published: (2024)

Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers
by: Liang, Yingyu, et al.
Published: (2024)

Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching
by: Dong, Yanhao, et al.
Published: (2025)

MAnchors: Memorization-Based Acceleration of Anchors via Rule Reuse and Transformation
by: Yu, Haonan, et al.
Published: (2025)

On-Demand Multi-Task Sparsity for Efficient Large-Model Deployment on Edge Devices
by: Huang, Lianming, et al.
Published: (2025)

MMET: A Multi-Input and Multi-Scale Transformer for Efficient PDEs Solving
by: Luo, Yichen, et al.
Published: (2025)

Comet: A Communication-efficient and Performant Approximation for Private Transformer Inference
by: Xu, Xiangrui, et al.
Published: (2024)

Multi-task GINN-LP for Multi-target Symbolic Regression
by: Rajabu, Hussein, et al.
Published: (2025)

M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference
by: Bhendawade, Nikhil, et al.
Published: (2025)

Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism
by: Yan, Jiaming, et al.
Published: (2025)

GCoDE: Efficient Device-Edge Co-Inference for GNNs via Architecture-Mapping Co-Search
by: Zhou, Ao, et al.
Published: (2025)

Accelerating Inference of Discrete Autoregressive Normalizing Flows by Selective Jacobi Decoding
by: Zhang, Jiaru, et al.
Published: (2025)

BigMac: A Communication-Efficient Mixture-of-Experts Model Structure for Fast Training and Inference
by: Jin, Zewen, et al.
Published: (2025)

BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference
by: Wu, Xiaoyou, et al.
Published: (2026)

Self-Tuning Sparse Attention: Multi-Fidelity Hyperparameter Optimization for Transformer Acceleration
by: Dev, Arundhathi, et al.
Published: (2026)

Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment
by: Ji, Yuhao, et al.
Published: (2024)

Scaling On-Device GPU Inference for Large Generative Models
by: Tang, Jiuqiang, et al.
Published: (2025)

End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost
by: Tan, Qitao, et al.
Published: (2025)

Inference Optimization of Foundation Models on AI Accelerators
by: Park, Youngsuk, et al.
Published: (2024)

Speculating Experts Accelerates Inference for Mixture-of-Experts
by: Madan, Vivan, et al.
Published: (2026)

CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling
by: Zhao, Runsong, et al.
Published: (2026)

MixPE: Quantization and Hardware Co-design for Efficient LLM Inference
by: Zhang, Yu, et al.
Published: (2024)

Bayesian Inverse Problems Meet Flow Matching: Efficient and Flexible Inference via Transformers
by: Sherki, Daniil, et al.
Published: (2025)