Saved in:
| Main Authors: | Akewar, Mayur, Madireddy, Sandeep, Luo, Dongsheng, Bhimani, Janki |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.10246 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive
by: Luo, Xinhao, et al.
Published: (2025)
by: Luo, Xinhao, et al.
Published: (2025)
Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling
by: Da, Wei, et al.
Published: (2025)
by: Da, Wei, et al.
Published: (2025)
Evaluating the Efficacy of LLM-Based Reasoning for Multiobjective HPC Job Scheduling
by: Jadhav, Prachi, et al.
Published: (2025)
by: Jadhav, Prachi, et al.
Published: (2025)
Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention
by: Liao, Mengqi, et al.
Published: (2026)
by: Liao, Mengqi, et al.
Published: (2026)
B-PASTE: Beam-Aware Pattern-Guided Speculative Execution for Resource-Constrained LLM Agents
by: Song, Yanfei
Published: (2026)
by: Song, Yanfei
Published: (2026)
Idiosyncrasies of Programmable Caching Engines
by: Peixoto, José, et al.
Published: (2026)
by: Peixoto, José, et al.
Published: (2026)
Accelerated Digital Twin Learning for Edge AI: A Comparison of FPGA and Mobile GPU
by: Xu, Bin, et al.
Published: (2025)
by: Xu, Bin, et al.
Published: (2025)
FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference
by: Liu, Xing, et al.
Published: (2025)
by: Liu, Xing, et al.
Published: (2025)
Simplifying Root Cause Analysis in Kubernetes with StateGraph and LLM
by: Xiang, Yong, et al.
Published: (2025)
by: Xiang, Yong, et al.
Published: (2025)
EdgeReasoning: Characterizing Reasoning LLM Deployment on Edge GPUs
by: Kubwimana, Benjamin, et al.
Published: (2025)
by: Kubwimana, Benjamin, et al.
Published: (2025)
SemanticForge: Repository-Level Code Generation through Semantic Knowledge Graphs and Constraint Satisfaction
by: Zhang, Wuyang, et al.
Published: (2025)
by: Zhang, Wuyang, et al.
Published: (2025)
Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs
by: Chen, Aodong, et al.
Published: (2023)
by: Chen, Aodong, et al.
Published: (2023)
Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration
by: Li, Zhonggen, et al.
Published: (2025)
by: Li, Zhonggen, et al.
Published: (2025)
Viability and Performance of a Private LLM Server for SMBs: A Benchmark Analysis of Qwen3-30B on Consumer-Grade Hardware
by: Khalil, Alex, et al.
Published: (2025)
by: Khalil, Alex, et al.
Published: (2025)
Transforming Future Data Center Operations and Management via Physical AI
by: Cao, Zhiwei, et al.
Published: (2025)
by: Cao, Zhiwei, et al.
Published: (2025)
xLLM Technical Report
by: Liu, Tongxuan, et al.
Published: (2025)
by: Liu, Tongxuan, et al.
Published: (2025)
Elastic On-Device LLM Service
by: Yin, Wangsong, et al.
Published: (2024)
by: Yin, Wangsong, et al.
Published: (2024)
SparOA: Sparse and Operator-aware Hybrid Scheduling for Edge DNN Inference
by: Zhang, Ziyang, et al.
Published: (2025)
by: Zhang, Ziyang, et al.
Published: (2025)
A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM
by: Xi, Shaoke, et al.
Published: (2026)
by: Xi, Shaoke, et al.
Published: (2026)
AI Benchmarks and Datasets for LLM Evaluation
by: Ivanov, Todor, et al.
Published: (2024)
by: Ivanov, Todor, et al.
Published: (2024)
Learning Provably Correct Distributed Protocols Without Human Knowledge
by: Hui, Yujie, et al.
Published: (2026)
by: Hui, Yujie, et al.
Published: (2026)
PolyKAN: Efficient Fused GPU Operators for Polynomial Kolmogorov-Arnold Network Variants
by: Yu, Mingkun, et al.
Published: (2025)
by: Yu, Mingkun, et al.
Published: (2025)
MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning
by: Liaw, Yong-Cheng, et al.
Published: (2025)
by: Liaw, Yong-Cheng, et al.
Published: (2025)
A Planet Scale Spatial-Temporal Knowledge Graph Based On OpenStreetMap And H3 Grid
by: Böckling, Martin, et al.
Published: (2024)
by: Böckling, Martin, et al.
Published: (2024)
Revisiting Parameter Server in LLM Post-Training
by: Wan, Xinyi, et al.
Published: (2026)
by: Wan, Xinyi, et al.
Published: (2026)
Accelerating LLM Inference with Precomputed Query Storage
by: Park, Jay H., et al.
Published: (2025)
by: Park, Jay H., et al.
Published: (2025)
High-Throughput LLM inference on Heterogeneous Clusters
by: Xiong, Yi, et al.
Published: (2025)
by: Xiong, Yi, et al.
Published: (2025)
Byzantine-Robust Decentralized Coordination of LLM Agents
by: Jo, Yongrae, et al.
Published: (2025)
by: Jo, Yongrae, et al.
Published: (2025)
Tutoring LLM into a Better CUDA Optimizer
by: Brabec, Matyáš, et al.
Published: (2025)
by: Brabec, Matyáš, et al.
Published: (2025)
A Parallel CPU-GPU Framework for Batching Heuristic Operations in Depth-First Heuristic Search
by: Futuhi, Ehsan, et al.
Published: (2025)
by: Futuhi, Ehsan, et al.
Published: (2025)
A Hashgraph-Inspired Consensus Mechanism for Reliable Multi-Model Reasoning
by: Ogunsina, Kolawole E., et al.
Published: (2025)
by: Ogunsina, Kolawole E., et al.
Published: (2025)
LLM Inference Serving: Survey of Recent Advances and Opportunities
by: Li, Baolin, et al.
Published: (2024)
by: Li, Baolin, et al.
Published: (2024)
Decentralized AI: Permissionless LLM Inference on POKT Network
by: Olshansky, Daniel, et al.
Published: (2024)
by: Olshansky, Daniel, et al.
Published: (2024)
ECCENTRIC: Edge-Cloud Collaboration Framework for Distributed Inference Using Knowledge Adaptation
by: Kamani, Mohammad Mahdi, et al.
Published: (2025)
by: Kamani, Mohammad Mahdi, et al.
Published: (2025)
PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization
by: Lei, Kelun, et al.
Published: (2025)
by: Lei, Kelun, et al.
Published: (2025)
Privacy-Preserving Federated Learning: Integrating Zero-Knowledge Proofs in Scalable Distributed Architectures
by: Gupta, Divya
Published: (2026)
by: Gupta, Divya
Published: (2026)
LAPS: A Length-Aware-Prefill LLM Serving System
by: She, Jianshu, et al.
Published: (2026)
by: She, Jianshu, et al.
Published: (2026)
Scepsy: Serving Agentic Workflows Using Aggregate LLM Pipelines
by: Wagenländer, Marcel, et al.
Published: (2026)
by: Wagenländer, Marcel, et al.
Published: (2026)
PALS: Power-Aware LLM Serving for Mixture-of-Experts Models
by: Hankendi, Can, et al.
Published: (2026)
by: Hankendi, Can, et al.
Published: (2026)
FairBatching: Fairness-Aware Batch Formation for LLM Inference
by: Lyu, Hongtao, et al.
Published: (2025)
by: Lyu, Hongtao, et al.
Published: (2025)
Similar Items
-
ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive
by: Luo, Xinhao, et al.
Published: (2025) -
Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling
by: Da, Wei, et al.
Published: (2025) -
Evaluating the Efficacy of LLM-Based Reasoning for Multiobjective HPC Job Scheduling
by: Jadhav, Prachi, et al.
Published: (2025) -
Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention
by: Liao, Mengqi, et al.
Published: (2026) -
B-PASTE: Beam-Aware Pattern-Guided Speculative Execution for Resource-Constrained LLM Agents
by: Song, Yanfei
Published: (2026)