Saved in:
| Main Authors: | Dong, Ximing, Wang, Shaowei, Lin, Dayi, Chen, Boyuan, Hassan, Ahmed E. |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.03708 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization
by: Dong, Ximing, et al.
Published: (2025)
by: Dong, Ximing, et al.
Published: (2025)
A Framework for Real-time Safeguarding the Text Generation of Large Language Model
by: Dong, Ximing, et al.
Published: (2024)
by: Dong, Ximing, et al.
Published: (2024)
CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing
by: Zheng, Wenhao, et al.
Published: (2025)
by: Zheng, Wenhao, et al.
Published: (2025)
PromptExp: Multi-granularity Prompt Explanation of Large Language Models
by: Dong, Ximing, et al.
Published: (2024)
by: Dong, Ximing, et al.
Published: (2024)
DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention
by: Lee, Younjoo, et al.
Published: (2026)
by: Lee, Younjoo, et al.
Published: (2026)
Iterative Layer Pruning for Efficient Translation Inference
by: Moslem, Yasmin, et al.
Published: (2025)
by: Moslem, Yasmin, et al.
Published: (2025)
Model Compression and Efficient Inference for Large Language Models: A Survey
by: Wang, Wenxiao, et al.
Published: (2024)
by: Wang, Wenxiao, et al.
Published: (2024)
SimLens for Early Exit in Large Language Models: Eliciting Accurate Latent Predictions with One More Token
by: Ma, Ming, et al.
Published: (2025)
by: Ma, Ming, et al.
Published: (2025)
Optimizing Agentic Language Model Inference via Speculative Tool Calls
by: Nichols, Daniel, et al.
Published: (2025)
by: Nichols, Daniel, et al.
Published: (2025)
ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU
by: Sunesh, Aman, et al.
Published: (2026)
by: Sunesh, Aman, et al.
Published: (2026)
Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey
by: Moslem, Yasmin, et al.
Published: (2026)
by: Moslem, Yasmin, et al.
Published: (2026)
Flex Attention: A Programming Model for Generating Optimized Attention Kernels
by: Dong, Juechu, et al.
Published: (2024)
by: Dong, Juechu, et al.
Published: (2024)
SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens
by: Liu, Chengbo, et al.
Published: (2024)
by: Liu, Chengbo, et al.
Published: (2024)
Performance Characterization of Expert Router for Scalable LLM Inference
by: Pichlmeier, Josef, et al.
Published: (2024)
by: Pichlmeier, Josef, et al.
Published: (2024)
AttentionEngine: A Versatile Framework for Efficient Attention Mechanisms on Diverse Hardware Platforms
by: Chen, Feiyang, et al.
Published: (2025)
by: Chen, Feiyang, et al.
Published: (2025)
Speculative Decoding for Multi-Sample Inference
by: Li, Yiwei, et al.
Published: (2025)
by: Li, Yiwei, et al.
Published: (2025)
Accelerating Diffusion LLMs via Adaptive Parallel Decoding
by: Israel, Daniel, et al.
Published: (2025)
by: Israel, Daniel, et al.
Published: (2025)
R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing
by: Fu, Tianyu, et al.
Published: (2025)
by: Fu, Tianyu, et al.
Published: (2025)
Systematic Evaluation of Optimization Techniques for Long-Context Language Models
by: Ahmed, Ammar, et al.
Published: (2025)
by: Ahmed, Ammar, et al.
Published: (2025)
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
by: Hendria, Willy Fitra
Published: (2026)
by: Hendria, Willy Fitra
Published: (2026)
BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache
by: Du, Dayou, et al.
Published: (2025)
by: Du, Dayou, et al.
Published: (2025)
HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
by: Lin, Mao, et al.
Published: (2026)
by: Lin, Mao, et al.
Published: (2026)
Bench360: Benchmarking Local LLM Inference from 360 Degrees
by: Stuhlmann, Linus, et al.
Published: (2025)
by: Stuhlmann, Linus, et al.
Published: (2025)
EPIC: Efficient Position-Independent Caching for Serving Large Language Models
by: Hu, Junhao, et al.
Published: (2024)
by: Hu, Junhao, et al.
Published: (2024)
L1RA: Dynamic Rank Assignment in LoRA Fine-Tuning
by: Singh, Raul, et al.
Published: (2025)
by: Singh, Raul, et al.
Published: (2025)
KG-EDAS: A Meta-Metric Framework for Evaluating Knowledge Graph Completion Models
by: Gul, Haji, et al.
Published: (2025)
by: Gul, Haji, et al.
Published: (2025)
LFED: A Literary Fiction Evaluation Dataset for Large Language Models
by: Yu, Linhao, et al.
Published: (2024)
by: Yu, Linhao, et al.
Published: (2024)
Green AI: Exploring Carbon Footprints, Mitigation Strategies, and Trade Offs in Large Language Model Training
by: Liu, Vivian, et al.
Published: (2024)
by: Liu, Vivian, et al.
Published: (2024)
Layer Importance and Hallucination Analysis in Large Language Models via Enhanced Activation Variance-Sparsity
by: Song, Zichen, et al.
Published: (2024)
by: Song, Zichen, et al.
Published: (2024)
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
by: Lin, Yujun, et al.
Published: (2024)
by: Lin, Yujun, et al.
Published: (2024)
Steering Pretrained Drafters during Speculative Decoding
by: Berdoz, Frédéric, et al.
Published: (2025)
by: Berdoz, Frédéric, et al.
Published: (2025)
Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations
by: Tyukin, Georgy
Published: (2024)
by: Tyukin, Georgy
Published: (2024)
Investigating Execution-Aware Language Models for Code Optimization
by: Di Menna, Federico, et al.
Published: (2025)
by: Di Menna, Federico, et al.
Published: (2025)
From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning
by: Purohit, Kiran, et al.
Published: (2026)
by: Purohit, Kiran, et al.
Published: (2026)
TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees
by: Liu, Tianyu, et al.
Published: (2026)
by: Liu, Tianyu, et al.
Published: (2026)
ISO: Overlap of Computation and Communication within Seqenence For LLM Inference
by: Xiao, Bin, et al.
Published: (2024)
by: Xiao, Bin, et al.
Published: (2024)
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
by: Tu, Dezhan, et al.
Published: (2024)
by: Tu, Dezhan, et al.
Published: (2024)
A Data-driven ML Approach for Maximizing Performance in LLM-Adapter Serving
by: Agullo, Ferran, et al.
Published: (2025)
by: Agullo, Ferran, et al.
Published: (2025)
Energy-Aware LLMs: A step towards sustainable AI for downstream applications
by: Tran, Nguyen Phuc, et al.
Published: (2025)
by: Tran, Nguyen Phuc, et al.
Published: (2025)
See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs
by: Ji, Yicheng, et al.
Published: (2026)
by: Ji, Yicheng, et al.
Published: (2026)
Similar Items
-
Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization
by: Dong, Ximing, et al.
Published: (2025) -
A Framework for Real-time Safeguarding the Text Generation of Large Language Model
by: Dong, Ximing, et al.
Published: (2024) -
CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing
by: Zheng, Wenhao, et al.
Published: (2025) -
PromptExp: Multi-granularity Prompt Explanation of Large Language Models
by: Dong, Ximing, et al.
Published: (2024) -
DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention
by: Lee, Younjoo, et al.
Published: (2026)