Na minha lista:
| Principais autores: | Tian, Jian, Li, Shuailong, Cao, Yang, Cui, Wenbo, Zhu, Minghan, Wu, Wenkang, Zhang, Jianming, Wang, Yanpeng, Xiao, Zhiwen, Hou, Zhenyu, Shen, Dou |
|---|---|
| Formato: | Preprint |
| Publicado em: |
2025
|
| Assuntos: | |
| Acesso em linha: | https://arxiv.org/abs/2512.16134 |
| Tags: |
Adicionar Tag
Sem tags, seja o primeiro a adicionar uma tag!
|
Registros relacionados
Optimizing LLM Inference Throughput via Memory-aware and SLA-constrained Dynamic Batching
por: Pang, Bowen, et al.
Publicado em: (2025)
por: Pang, Bowen, et al.
Publicado em: (2025)
CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control
por: Chen, Qiaoling, et al.
Publicado em: (2026)
por: Chen, Qiaoling, et al.
Publicado em: (2026)
Joint Optimization of Offloading, Batching and DVFS for Multiuser Co-Inference
por: Xu, Yaodan, et al.
Publicado em: (2025)
por: Xu, Yaodan, et al.
Publicado em: (2025)
BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching
por: Zheng, Zhen, et al.
Publicado em: (2024)
por: Zheng, Zhen, et al.
Publicado em: (2024)
HarmonyBatch: Batching multi-SLO DNN Inference with Heterogeneous Serverless Functions
por: Chen, Jiabin, et al.
Publicado em: (2024)
por: Chen, Jiabin, et al.
Publicado em: (2024)
Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving
por: Cheng, Ke, et al.
Publicado em: (2024)
por: Cheng, Ke, et al.
Publicado em: (2024)
Batch-Schedule-Execute: On Optimizing Concurrent Deterministic Scheduling for Blockchains (Extended Version)
por: Hay, Yaron, et al.
Publicado em: (2024)
por: Hay, Yaron, et al.
Publicado em: (2024)
MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching
por: Xu, Tairan, et al.
Publicado em: (2025)
por: Xu, Tairan, et al.
Publicado em: (2025)
ACE-GNN: Adaptive GNN Co-Inference with System-Aware Scheduling in Dynamic Edge Environments
por: Zhou, Ao, et al.
Publicado em: (2025)
por: Zhou, Ao, et al.
Publicado em: (2025)
Multi-Bin Batching for Increasing LLM Inference Throughput
por: Guldogan, Ozgur, et al.
Publicado em: (2024)
por: Guldogan, Ozgur, et al.
Publicado em: (2024)
On the Efficiency of Dynamic Transaction Scheduling in Blockchain Sharding
por: Adhikari, Ramesh, et al.
Publicado em: (2025)
por: Adhikari, Ramesh, et al.
Publicado em: (2025)
FairBatching: Fairness-Aware Batch Formation for LLM Inference
por: Lyu, Hongtao, et al.
Publicado em: (2025)
por: Lyu, Hongtao, et al.
Publicado em: (2025)
AntBatchInfer: Elastic Batch Inference in the Kubernetes Cluster
por: Li, Siyuan, et al.
Publicado em: (2024)
por: Li, Siyuan, et al.
Publicado em: (2024)
Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads
por: Hidayetoglu, Mert, et al.
Publicado em: (2025)
por: Hidayetoglu, Mert, et al.
Publicado em: (2025)
Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling
por: Li, Yan, et al.
Publicado em: (2025)
por: Li, Yan, et al.
Publicado em: (2025)
TD-Pipe: Temporally-Disaggregated Pipeline Parallelism Architecture for High-Throughput LLM Inference
por: Zhang, Hongbin, et al.
Publicado em: (2025)
por: Zhang, Hongbin, et al.
Publicado em: (2025)
SLO-Aware Scheduling for Large Language Model Inferences
por: Huang, Jinqi, et al.
Publicado em: (2025)
por: Huang, Jinqi, et al.
Publicado em: (2025)
Towards Energy Efficient Co-Scheduling in HPC
por: Zheng, Zhong, et al.
Publicado em: (2026)
por: Zheng, Zhong, et al.
Publicado em: (2026)
A HPC Co-Scheduler with Reinforcement Learning
por: Souza, Abel, et al.
Publicado em: (2024)
por: Souza, Abel, et al.
Publicado em: (2024)
COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training
por: Sakip, Akhmed, et al.
Publicado em: (2026)
por: Sakip, Akhmed, et al.
Publicado em: (2026)
Round-optimal $n$-Block Broadcast Schedules in Logarithmic Time
por: Träff, Jesper Larsson
Publicado em: (2023)
por: Träff, Jesper Larsson
Publicado em: (2023)
Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture
por: Wu, Yu, et al.
Publicado em: (2025)
por: Wu, Yu, et al.
Publicado em: (2025)
Argus: Token Aware Distributed LLM Inference Optimization
por: Wu, Panlong, et al.
Publicado em: (2025)
por: Wu, Panlong, et al.
Publicado em: (2025)
Towards Fast Setup and High Throughput of GPU Serverless Computing
por: Zhao, Han, et al.
Publicado em: (2024)
por: Zhao, Han, et al.
Publicado em: (2024)
Symphony: Optimized DNN Model Serving using Deferred Batch Scheduling
por: Chen, Lequn, et al.
Publicado em: (2023)
por: Chen, Lequn, et al.
Publicado em: (2023)
Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference
por: Wang, Qipeng
Publicado em: (2026)
por: Wang, Qipeng
Publicado em: (2026)
SLICE: SLO-Driven Scheduling for LLM Inference on Edge Computing Devices
por: Chow, Will
Publicado em: (2025)
por: Chow, Will
Publicado em: (2025)
Research on fault diagnosis and root cause analysis based on full stack observability
por: Hou, Jian
Publicado em: (2025)
por: Hou, Jian
Publicado em: (2025)
Scaling Up Throughput-oriented LLM Inference Applications on Heterogeneous Opportunistic GPU Clusters with Pervasive Context Management
por: Phung, Thanh Son, et al.
Publicado em: (2025)
por: Phung, Thanh Son, et al.
Publicado em: (2025)
Constraint Programming Models For Serial Batch Scheduling With Minimum Batch Size
por: Huertas, Jorge A., et al.
Publicado em: (2025)
por: Huertas, Jorge A., et al.
Publicado em: (2025)
AdaOper: Energy-efficient and Responsive Concurrent DNN Inference on Mobile Devices
por: Lin, Zheng, et al.
Publicado em: (2024)
por: Lin, Zheng, et al.
Publicado em: (2024)
Adaptive Heuristics for Scheduling DNN Inferencing on Edge and Cloud for Personalized UAV Fleets
por: Raj, Suman, et al.
Publicado em: (2024)
por: Raj, Suman, et al.
Publicado em: (2024)
SneakPeek: Data-Aware Model Selection and Scheduling for Inference Serving on the Edge
por: Wolfrath, Joel, et al.
Publicado em: (2025)
por: Wolfrath, Joel, et al.
Publicado em: (2025)
DARIS: An Oversubscribed Spatio-Temporal Scheduler for Real-Time DNN Inference on GPUs
por: Babaei, Amir Fakhim, et al.
Publicado em: (2025)
por: Babaei, Amir Fakhim, et al.
Publicado em: (2025)
Orchestrating Joint Offloading and Scheduling for Low-Latency Edge SLAM
por: Zhang, Yao, et al.
Publicado em: (2025)
por: Zhang, Yao, et al.
Publicado em: (2025)
CoCoI: Distributed Coded Inference System for Straggler Mitigation
por: Liu, Xing, et al.
Publicado em: (2025)
por: Liu, Xing, et al.
Publicado em: (2025)
Designing Co-operation in Systems of Hierarchical, Multi-objective Schedulers for Stream Processing
por: Dangwal, Animesh, et al.
Publicado em: (2025)
por: Dangwal, Animesh, et al.
Publicado em: (2025)
AdaBridge: Dynamic Data and Computation Reuse for Efficient Multi-task DNN Co-evolution in Edge Systems
por: Wang, Lehao, et al.
Publicado em: (2024)
por: Wang, Lehao, et al.
Publicado em: (2024)
KV Cache Compression for Inference Efficiency in LLMs: A Review
por: Liu, Yanyu, et al.
Publicado em: (2025)
por: Liu, Yanyu, et al.
Publicado em: (2025)
Night-Window Batching versus Carbon-Aware Scheduling for Clinical AI GPU Workloads
por: Doshi, Nishi, et al.
Publicado em: (2026)
por: Doshi, Nishi, et al.
Publicado em: (2026)
Registros relacionados
-
Optimizing LLM Inference Throughput via Memory-aware and SLA-constrained Dynamic Batching
por: Pang, Bowen, et al.
Publicado em: (2025) -
CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control
por: Chen, Qiaoling, et al.
Publicado em: (2026) -
Joint Optimization of Offloading, Batching and DVFS for Multiuser Co-Inference
por: Xu, Yaodan, et al.
Publicado em: (2025) -
BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching
por: Zheng, Zhen, et al.
Publicado em: (2024) -
HarmonyBatch: Batching multi-SLO DNN Inference with Heterogeneous Serverless Functions
por: Chen, Jiabin, et al.
Publicado em: (2024)