Saved in:
| Main Authors: | Chen, Huamin, Liu, Xunzhuo, Jiang, Junchen, He, Bowei, Liu, Xue |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.13426 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
by: Chen, Huamin, et al.
Published: (2026)
by: Chen, Huamin, et al.
Published: (2026)
Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference
by: Chen, Huamin, et al.
Published: (2026)
by: Chen, Huamin, et al.
Published: (2026)
Conflict-Free Policy Languages for Probabilistic ML Predicates: A Framework and Case Study with the Semantic Router DSL
by: Liu, Xunzhuo, et al.
Published: (2026)
by: Liu, Xunzhuo, et al.
Published: (2026)
When to Reason: Semantic Router for vLLM
by: Wang, Chen, et al.
Published: (2025)
by: Wang, Chen, et al.
Published: (2025)
Category-Aware Semantic Caching for Heterogeneous LLM Workloads
by: Wang, Chen, et al.
Published: (2025)
by: Wang, Chen, et al.
Published: (2025)
From Inference Routing to Agent Orchestration: Declarative Policy Compilation with Cross-Layer Verification
by: Chen, Huamin, et al.
Published: (2026)
by: Chen, Huamin, et al.
Published: (2026)
98$\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router
by: Liu, Xunzhuo, et al.
Published: (2026)
by: Liu, Xunzhuo, et al.
Published: (2026)
FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism
by: Chen, Huamin, et al.
Published: (2026)
by: Chen, Huamin, et al.
Published: (2026)
inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference
by: Chen, Huamin, et al.
Published: (2026)
by: Chen, Huamin, et al.
Published: (2026)
The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency
by: Chen, Huamin, et al.
Published: (2026)
by: Chen, Huamin, et al.
Published: (2026)
Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks
by: Bao, Rui, et al.
Published: (2025)
by: Bao, Rui, et al.
Published: (2025)
Lodestar: An Online-Learning LLM Inference Router
by: Lim, Gangmuk, et al.
Published: (2026)
by: Lim, Gangmuk, et al.
Published: (2026)
TensorOpera Router: A Multi-Model Router for Efficient LLM Inference
by: Stripelis, Dimitris, et al.
Published: (2024)
by: Stripelis, Dimitris, et al.
Published: (2024)
ICL-Router: In-Context Learned Model Representations for LLM Routing
by: Wang, Chenxu, et al.
Published: (2025)
by: Wang, Chenxu, et al.
Published: (2025)
OrcaRouter: A Production-Oriented LLM Router with Hybrid Offline-Online Learning
by: Bao, Zhenghua, et al.
Published: (2026)
by: Bao, Zhenghua, et al.
Published: (2026)
RouterDC: Query-Based Router by Dual Contrastive Learning for Assembling Large Language Models
by: Chen, Shuhao, et al.
Published: (2024)
by: Chen, Shuhao, et al.
Published: (2024)
xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning
by: Qian, Cheng, et al.
Published: (2025)
by: Qian, Cheng, et al.
Published: (2025)
SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models
by: Liu, Junnan, et al.
Published: (2026)
by: Liu, Junnan, et al.
Published: (2026)
RouterBench: A Benchmark for Multi-LLM Routing System
by: Hu, Qitian Jason, et al.
Published: (2024)
by: Hu, Qitian Jason, et al.
Published: (2024)
Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing
by: Sung, Mingyu, et al.
Published: (2025)
by: Sung, Mingyu, et al.
Published: (2025)
AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size
by: Lu, Guanxi, et al.
Published: (2025)
by: Lu, Guanxi, et al.
Published: (2025)
Beyond One-Way Pruning: Bidirectional Pruning-Regrowth for Extreme Accuracy-Sparsity Tradeoff
by: Liu, Junchen, et al.
Published: (2025)
by: Liu, Junchen, et al.
Published: (2025)
Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers
by: Ma, Wenhan, et al.
Published: (2025)
by: Ma, Wenhan, et al.
Published: (2025)
LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments
by: Huang, Jin, et al.
Published: (2025)
by: Huang, Jin, et al.
Published: (2025)
Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency
by: Li, Ruixiao, et al.
Published: (2025)
by: Li, Ruixiao, et al.
Published: (2025)
CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification
by: He, Junhui, et al.
Published: (2024)
by: He, Junhui, et al.
Published: (2024)
IterIS: Iterative Inference-Solving Alignment for LoRA Merging
by: Chen, Hongxu, et al.
Published: (2024)
by: Chen, Hongxu, et al.
Published: (2024)
Learn More, Forget Less: A Gradient-Aware Data Selection Approach for LLM
by: Liu, Yibai, et al.
Published: (2025)
by: Liu, Yibai, et al.
Published: (2025)
ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference
by: Zhu, Xiongwei, et al.
Published: (2026)
by: Zhu, Xiongwei, et al.
Published: (2026)
Learning to Select the Best Forecasting Tasks for Clinical Outcome Prediction
by: Xue, Yuan, et al.
Published: (2024)
by: Xue, Yuan, et al.
Published: (2024)
Treatment-Aware Hyperbolic Representation Learning for Causal Effect Estimation with Social Networks
by: Cui, Ziqiang, et al.
Published: (2024)
by: Cui, Ziqiang, et al.
Published: (2024)
ProxRouter: Proximity-Weighted LLM Query Routing for Improved Robustness to Outliers
by: Patel, Shivam, et al.
Published: (2025)
by: Patel, Shivam, et al.
Published: (2025)
Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement
by: Jeon, Wonseok, et al.
Published: (2024)
by: Jeon, Wonseok, et al.
Published: (2024)
Semantic Scheduling for LLM Inference
by: Hua, Wenyue, et al.
Published: (2025)
by: Hua, Wenyue, et al.
Published: (2025)
Repeat-Aware Neighbor Sampling for Dynamic Graph Learning
by: Zou, Tao, et al.
Published: (2024)
by: Zou, Tao, et al.
Published: (2024)
Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates
by: Li, Yibo, et al.
Published: (2026)
by: Li, Yibo, et al.
Published: (2026)
LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
by: Jin, Hongye, et al.
Published: (2024)
by: Jin, Hongye, et al.
Published: (2024)
ALADIN: Accuracy-Latency-Aware Design-space Inference Analysis for Embedded AI Accelerators
by: Baldi, T., et al.
Published: (2026)
by: Baldi, T., et al.
Published: (2026)
MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale Deployment
by: Huang, Hanxian, et al.
Published: (2026)
by: Huang, Hanxian, et al.
Published: (2026)
A Middle Path for On-Premises LLM Deployment: Preserving Privacy Without Sacrificing Model Confidentiality
by: Huang, Hanbo, et al.
Published: (2024)
by: Huang, Hanbo, et al.
Published: (2024)
Similar Items
-
The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
by: Chen, Huamin, et al.
Published: (2026) -
Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference
by: Chen, Huamin, et al.
Published: (2026) -
Conflict-Free Policy Languages for Probabilistic ML Predicates: A Framework and Case Study with the Semantic Router DSL
by: Liu, Xunzhuo, et al.
Published: (2026) -
When to Reason: Semantic Router for vLLM
by: Wang, Chen, et al.
Published: (2025) -
Category-Aware Semantic Caching for Heterogeneous LLM Workloads
by: Wang, Chen, et al.
Published: (2025)