:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Dong, Ximing, Wang, Shaowei, Lin, Dayi, Chen, Boyuan, Hassan, Ahmed E.
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Performance
Online Access:	https://arxiv.org/abs/2602.03708
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization
by: Dong, Ximing, et al.
Published: (2025)

A Framework for Real-time Safeguarding the Text Generation of Large Language Model
by: Dong, Ximing, et al.
Published: (2024)

CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing
by: Zheng, Wenhao, et al.
Published: (2025)

PromptExp: Multi-granularity Prompt Explanation of Large Language Models
by: Dong, Ximing, et al.
Published: (2024)

DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention
by: Lee, Younjoo, et al.
Published: (2026)

Iterative Layer Pruning for Efficient Translation Inference
by: Moslem, Yasmin, et al.
Published: (2025)

Model Compression and Efficient Inference for Large Language Models: A Survey
by: Wang, Wenxiao, et al.
Published: (2024)

SimLens for Early Exit in Large Language Models: Eliciting Accurate Latent Predictions with One More Token
by: Ma, Ming, et al.
Published: (2025)

Optimizing Agentic Language Model Inference via Speculative Tool Calls
by: Nichols, Daniel, et al.
Published: (2025)

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU
by: Sunesh, Aman, et al.
Published: (2026)

Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey
by: Moslem, Yasmin, et al.
Published: (2026)

Flex Attention: A Programming Model for Generating Optimized Attention Kernels
by: Dong, Juechu, et al.
Published: (2024)

SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens
by: Liu, Chengbo, et al.
Published: (2024)

Performance Characterization of Expert Router for Scalable LLM Inference
by: Pichlmeier, Josef, et al.
Published: (2024)

AttentionEngine: A Versatile Framework for Efficient Attention Mechanisms on Diverse Hardware Platforms
by: Chen, Feiyang, et al.
Published: (2025)

Speculative Decoding for Multi-Sample Inference
by: Li, Yiwei, et al.
Published: (2025)

Accelerating Diffusion LLMs via Adaptive Parallel Decoding
by: Israel, Daniel, et al.
Published: (2025)

R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing
by: Fu, Tianyu, et al.
Published: (2025)

Systematic Evaluation of Optimization Techniques for Long-Context Language Models
by: Ahmed, Ammar, et al.
Published: (2025)

Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
by: Hendria, Willy Fitra
Published: (2026)

BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache
by: Du, Dayou, et al.
Published: (2025)

HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
by: Lin, Mao, et al.
Published: (2026)

Bench360: Benchmarking Local LLM Inference from 360 Degrees
by: Stuhlmann, Linus, et al.
Published: (2025)

EPIC: Efficient Position-Independent Caching for Serving Large Language Models
by: Hu, Junhao, et al.
Published: (2024)

L1RA: Dynamic Rank Assignment in LoRA Fine-Tuning
by: Singh, Raul, et al.
Published: (2025)

KG-EDAS: A Meta-Metric Framework for Evaluating Knowledge Graph Completion Models
by: Gul, Haji, et al.
Published: (2025)

LFED: A Literary Fiction Evaluation Dataset for Large Language Models
by: Yu, Linhao, et al.
Published: (2024)

Green AI: Exploring Carbon Footprints, Mitigation Strategies, and Trade Offs in Large Language Model Training
by: Liu, Vivian, et al.
Published: (2024)

Layer Importance and Hallucination Analysis in Large Language Models via Enhanced Activation Variance-Sparsity
by: Song, Zichen, et al.
Published: (2024)

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
by: Lin, Yujun, et al.
Published: (2024)

Steering Pretrained Drafters during Speculative Decoding
by: Berdoz, Frédéric, et al.
Published: (2025)

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations
by: Tyukin, Georgy
Published: (2024)

Investigating Execution-Aware Language Models for Code Optimization
by: Di Menna, Federico, et al.
Published: (2025)

From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning
by: Purohit, Kiran, et al.
Published: (2026)

TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees
by: Liu, Tianyu, et al.
Published: (2026)

ISO: Overlap of Computation and Communication within Seqenence For LLM Inference
by: Xiao, Bin, et al.
Published: (2024)

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
by: Tu, Dezhan, et al.
Published: (2024)

A Data-driven ML Approach for Maximizing Performance in LLM-Adapter Serving
by: Agullo, Ferran, et al.
Published: (2025)

Energy-Aware LLMs: A step towards sustainable AI for downstream applications
by: Tran, Nguyen Phuc, et al.
Published: (2025)

See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs
by: Ji, Yicheng, et al.
Published: (2026)