Saved in:
| Main Authors: | Liu, Xiao, Zhang, Lijun, Ganesan, Deepak, Guan, Hui |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.19342 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Reimagining Parameter Space Exploration with Diffusion Models
by: Zhang, Lijun, et al.
Published: (2025)
by: Zhang, Lijun, et al.
Published: (2025)
Aligned Vector Quantization for Edge-Cloud Collabrative Vision-Language Models
by: Liu, Xiao, et al.
Published: (2024)
by: Liu, Xiao, et al.
Published: (2024)
WildFit: Autonomous In-situ Model Adaptation for Resource-Constrained IoT Systems
by: Rastikerdar, Mohammad Mehdi, et al.
Published: (2024)
by: Rastikerdar, Mohammad Mehdi, et al.
Published: (2024)
ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering
by: Guo, Xiaoke, et al.
Published: (2026)
by: Guo, Xiaoke, et al.
Published: (2026)
Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator
by: Ramachandran, Akshat, et al.
Published: (2025)
by: Ramachandran, Akshat, et al.
Published: (2025)
FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning
by: Liu, Junkang, et al.
Published: (2026)
by: Liu, Junkang, et al.
Published: (2026)
Communication-Efficient Federated Learning with Accelerated Client Gradient
by: Kim, Geeho, et al.
Published: (2022)
by: Kim, Geeho, et al.
Published: (2022)
Accelerating Transformer Inference for Translation via Parallel Decoding
by: Santilli, Andrea, et al.
Published: (2023)
by: Santilli, Andrea, et al.
Published: (2023)
RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference
by: Gautam, Arpit Singh, et al.
Published: (2026)
by: Gautam, Arpit Singh, et al.
Published: (2026)
FedSI: Federated Subnetwork Inference for Efficient Uncertainty Quantification
by: Chen, Hui, et al.
Published: (2024)
by: Chen, Hui, et al.
Published: (2024)
PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices
by: Liu, Yangyijian, et al.
Published: (2025)
by: Liu, Yangyijian, et al.
Published: (2025)
Accelerate Model Parallel Training by Using Efficient Graph Traversal Order in Device Placement
by: Wang, Tianze, et al.
Published: (2022)
by: Wang, Tianze, et al.
Published: (2022)
FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction
by: Cai, Yuxuan, et al.
Published: (2025)
by: Cai, Yuxuan, et al.
Published: (2025)
QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations
by: Zhao, Zhixiong, et al.
Published: (2025)
by: Zhao, Zhixiong, et al.
Published: (2025)
Accelerating Transformer Inference and Training with 2:4 Activation Sparsity
by: Haziza, Daniel, et al.
Published: (2025)
by: Haziza, Daniel, et al.
Published: (2025)
Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference
by: Jaradat, Ghadeer, et al.
Published: (2024)
by: Jaradat, Ghadeer, et al.
Published: (2024)
Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices
by: Han, Xueyuan, et al.
Published: (2024)
by: Han, Xueyuan, et al.
Published: (2024)
Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference
by: Skliar, Andrii, et al.
Published: (2024)
by: Skliar, Andrii, et al.
Published: (2024)
Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers
by: Liang, Yingyu, et al.
Published: (2024)
by: Liang, Yingyu, et al.
Published: (2024)
Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching
by: Dong, Yanhao, et al.
Published: (2025)
by: Dong, Yanhao, et al.
Published: (2025)
MAnchors: Memorization-Based Acceleration of Anchors via Rule Reuse and Transformation
by: Yu, Haonan, et al.
Published: (2025)
by: Yu, Haonan, et al.
Published: (2025)
On-Demand Multi-Task Sparsity for Efficient Large-Model Deployment on Edge Devices
by: Huang, Lianming, et al.
Published: (2025)
by: Huang, Lianming, et al.
Published: (2025)
MMET: A Multi-Input and Multi-Scale Transformer for Efficient PDEs Solving
by: Luo, Yichen, et al.
Published: (2025)
by: Luo, Yichen, et al.
Published: (2025)
Comet: A Communication-efficient and Performant Approximation for Private Transformer Inference
by: Xu, Xiangrui, et al.
Published: (2024)
by: Xu, Xiangrui, et al.
Published: (2024)
Multi-task GINN-LP for Multi-target Symbolic Regression
by: Rajabu, Hussein, et al.
Published: (2025)
by: Rajabu, Hussein, et al.
Published: (2025)
M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference
by: Bhendawade, Nikhil, et al.
Published: (2025)
by: Bhendawade, Nikhil, et al.
Published: (2025)
Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism
by: Yan, Jiaming, et al.
Published: (2025)
by: Yan, Jiaming, et al.
Published: (2025)
GCoDE: Efficient Device-Edge Co-Inference for GNNs via Architecture-Mapping Co-Search
by: Zhou, Ao, et al.
Published: (2025)
by: Zhou, Ao, et al.
Published: (2025)
Accelerating Inference of Discrete Autoregressive Normalizing Flows by Selective Jacobi Decoding
by: Zhang, Jiaru, et al.
Published: (2025)
by: Zhang, Jiaru, et al.
Published: (2025)
BigMac: A Communication-Efficient Mixture-of-Experts Model Structure for Fast Training and Inference
by: Jin, Zewen, et al.
Published: (2025)
by: Jin, Zewen, et al.
Published: (2025)
BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference
by: Wu, Xiaoyou, et al.
Published: (2026)
by: Wu, Xiaoyou, et al.
Published: (2026)
Self-Tuning Sparse Attention: Multi-Fidelity Hyperparameter Optimization for Transformer Acceleration
by: Dev, Arundhathi, et al.
Published: (2026)
by: Dev, Arundhathi, et al.
Published: (2026)
Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment
by: Ji, Yuhao, et al.
Published: (2024)
by: Ji, Yuhao, et al.
Published: (2024)
Scaling On-Device GPU Inference for Large Generative Models
by: Tang, Jiuqiang, et al.
Published: (2025)
by: Tang, Jiuqiang, et al.
Published: (2025)
End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost
by: Tan, Qitao, et al.
Published: (2025)
by: Tan, Qitao, et al.
Published: (2025)
Inference Optimization of Foundation Models on AI Accelerators
by: Park, Youngsuk, et al.
Published: (2024)
by: Park, Youngsuk, et al.
Published: (2024)
Speculating Experts Accelerates Inference for Mixture-of-Experts
by: Madan, Vivan, et al.
Published: (2026)
by: Madan, Vivan, et al.
Published: (2026)
CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling
by: Zhao, Runsong, et al.
Published: (2026)
by: Zhao, Runsong, et al.
Published: (2026)
MixPE: Quantization and Hardware Co-design for Efficient LLM Inference
by: Zhang, Yu, et al.
Published: (2024)
by: Zhang, Yu, et al.
Published: (2024)
Bayesian Inverse Problems Meet Flow Matching: Efficient and Flexible Inference via Transformers
by: Sherki, Daniil, et al.
Published: (2025)
by: Sherki, Daniil, et al.
Published: (2025)
Similar Items
-
Reimagining Parameter Space Exploration with Diffusion Models
by: Zhang, Lijun, et al.
Published: (2025) -
Aligned Vector Quantization for Edge-Cloud Collabrative Vision-Language Models
by: Liu, Xiao, et al.
Published: (2024) -
WildFit: Autonomous In-situ Model Adaptation for Resource-Constrained IoT Systems
by: Rastikerdar, Mohammad Mehdi, et al.
Published: (2024) -
ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering
by: Guo, Xiaoke, et al.
Published: (2026) -
Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator
by: Ramachandran, Akshat, et al.
Published: (2025)