Saved in:
| Main Authors: | Zhao, Youpeng, LV, Jinpeng, Wu, Di, Wang, Jun, Gooley, Christopher |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.19645 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching
by: Zhao, Youpeng, et al.
Published: (2024)
by: Zhao, Youpeng, et al.
Published: (2024)
ALISE: Accelerating Large Language Model Serving with Speculative Scheduling
by: Zhao, Youpeng, et al.
Published: (2024)
by: Zhao, Youpeng, et al.
Published: (2024)
GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving
by: Jayakody, Shakya, et al.
Published: (2026)
by: Jayakody, Shakya, et al.
Published: (2026)
The Race to Efficiency: A New Perspective on AI Scaling Laws
by: Lu, Chien-Ping
Published: (2025)
by: Lu, Chien-Ping
Published: (2025)
Time is Not Compute: Scaling Laws for Wall-Clock Constrained Training on Consumer GPUs
by: Liu, Yi
Published: (2026)
by: Liu, Yi
Published: (2026)
AI-driven Java Performance Testing: Balancing Result Quality with Testing Time
by: Traini, Luca, et al.
Published: (2024)
by: Traini, Luca, et al.
Published: (2024)
Generalizing Scaling Laws for Dense and Sparse Large Language Models
by: Hossain, Md Arafat, et al.
Published: (2025)
by: Hossain, Md Arafat, et al.
Published: (2025)
Throughput Optimization as a Strategic Lever in Large-Scale AI Systems: Evidence from Dataloader and Memory Profiling Innovations
by: Jha, Mayank
Published: (2026)
by: Jha, Mayank
Published: (2026)
Adaptive Orchestration for Large-Scale Inference on Heterogeneous Accelerator Systems Balancing Cost, Performance, and Resilience
by: Biran, Yahav, et al.
Published: (2025)
by: Biran, Yahav, et al.
Published: (2025)
DeepContext: A Context-aware, Cross-platform, and Cross-framework Tool for Performance Profiling and Analysis of Deep Learning Workloads
by: Zhao, Qidong, et al.
Published: (2024)
by: Zhao, Qidong, et al.
Published: (2024)
Offloading and Quality Control for AI Generated Content Services in 6G Mobile Edge Computing Networks
by: Wang, Yitong, et al.
Published: (2023)
by: Wang, Yitong, et al.
Published: (2023)
Comment on paper: Position: Rethinking Post-Hoc Search-Based Neural Approaches for Solving Large-Scale Traveling Salesman Problems
by: Min, Yimeng
Published: (2024)
by: Min, Yimeng
Published: (2024)
Can We Make Code Green? Understanding Trade-Offs in LLMs vs. Human Code Optimizations
by: Rani, Pooja, et al.
Published: (2025)
by: Rani, Pooja, et al.
Published: (2025)
A Theory of Inference Compute Scaling: Reasoning through Directed Stochastic Skill Search
by: Ellis-Mohr, Austin R., et al.
Published: (2025)
by: Ellis-Mohr, Austin R., et al.
Published: (2025)
LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics
by: Liu, Jiashuo, et al.
Published: (2025)
by: Liu, Jiashuo, et al.
Published: (2025)
Root Cause Localization for Microservice Systems in Cloud-edge Collaborative Environments
by: Zhu, Yuhan, et al.
Published: (2024)
by: Zhu, Yuhan, et al.
Published: (2024)
Improving LLM Performance Through Black-Box Online Tuning: A Case for Adding System Specs to Factsheets for Trusted AI
by: Atinafu, Yonas, et al.
Published: (2026)
by: Atinafu, Yonas, et al.
Published: (2026)
Quantifying the Generalization Gap in Seizure Detection: A Large-Scale Empirical Benchmark via the SzCORE Challenge
by: Dan, Jonathan, et al.
Published: (2025)
by: Dan, Jonathan, et al.
Published: (2025)
TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput
by: Liu, Xiaoxuan, et al.
Published: (2024)
by: Liu, Xiaoxuan, et al.
Published: (2024)
Rapid Augmentations for Time Series (RATS): A High-Performance Library for Time Series Augmentation
by: Skaf, Wadie, et al.
Published: (2026)
by: Skaf, Wadie, et al.
Published: (2026)
A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs
by: Singh, Siddharth, et al.
Published: (2023)
by: Singh, Siddharth, et al.
Published: (2023)
DDSA: Dual-Domain Strategic Attack for Spatial-Temporal Efficiency in Adversarial Robustness Testing
by: Hu, Jinwei, et al.
Published: (2026)
by: Hu, Jinwei, et al.
Published: (2026)
KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta
by: Liao, Gang, et al.
Published: (2025)
by: Liao, Gang, et al.
Published: (2025)
Research on Low-Latency Inference and Training Efficiency Optimization for Graph Neural Network and Large Language Model-Based Recommendation Systems
by: Zhao, Yushang, et al.
Published: (2025)
by: Zhao, Yushang, et al.
Published: (2025)
Scaling Multi Agent Reinforcement Learning for Underwater Acoustic Tracking via Autonomous Vehicles
by: Gallici, Matteo, et al.
Published: (2025)
by: Gallici, Matteo, et al.
Published: (2025)
Scaling Intelligence: Designing Data Centers for Next-Gen Language Models
by: Tithi, Jesmin Jahan, et al.
Published: (2025)
by: Tithi, Jesmin Jahan, et al.
Published: (2025)
STAlloc: Enhancing Memory Efficiency in Large-Scale Model Training with Spatio-Temporal Planning
by: Huang, Zixiao, et al.
Published: (2025)
by: Huang, Zixiao, et al.
Published: (2025)
Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System
by: Fang, Yunhua, et al.
Published: (2025)
by: Fang, Yunhua, et al.
Published: (2025)
FlashOmni: A Unified Sparse Attention Engine for Diffusion Transformers
by: Qiao, Liang, et al.
Published: (2025)
by: Qiao, Liang, et al.
Published: (2025)
XTC, A Research Platform for Optimizing AI Workload Operators
by: Hugo, Pompougnac, et al.
Published: (2025)
by: Hugo, Pompougnac, et al.
Published: (2025)
On the Compression of Language Models for Code: An Empirical Study on CodeBERT
by: d'Aloisio, Giordano, et al.
Published: (2024)
by: d'Aloisio, Giordano, et al.
Published: (2024)
LogiPlan: A Structured Benchmark for Logical Planning and Relational Reasoning in LLMs
by: Cai, Yanan, et al.
Published: (2025)
by: Cai, Yanan, et al.
Published: (2025)
Energy-Efficient Transformer Inference: Optimization Strategies for Time Series Classification
by: Kermani, Arshia, et al.
Published: (2025)
by: Kermani, Arshia, et al.
Published: (2025)
Energy Concerns with HPC Systems and Applications
by: Nana, Roblex, et al.
Published: (2023)
by: Nana, Roblex, et al.
Published: (2023)
This Is Taking Too Long -- Investigating Time as a Proxy for Energy Consumption of LLMs
by: Krupp, Lars, et al.
Published: (2026)
by: Krupp, Lars, et al.
Published: (2026)
Are We There Yet? A Measurement Study of Efficiency for LLM Applications on Mobile Devices
by: Yan, Xiao, et al.
Published: (2025)
by: Yan, Xiao, et al.
Published: (2025)
Reducing Latency of LLM Search Agent via Speculation-based Algorithm-System Co-Design
by: Huang, Zixiao, et al.
Published: (2025)
by: Huang, Zixiao, et al.
Published: (2025)
Revolutionizing System Reliability: The Role of AI in Predictive Maintenance Strategies
by: Bidollahkhani, Michael, et al.
Published: (2024)
by: Bidollahkhani, Michael, et al.
Published: (2024)
Performance Modeling of Data Storage Systems using Generative Models
by: Al-Maeeni, Abdalaziz Rashid, et al.
Published: (2023)
by: Al-Maeeni, Abdalaziz Rashid, et al.
Published: (2023)
Tanz der Dinge/Things that dance
Published: (2021)
Published: (2021)
Similar Items
-
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching
by: Zhao, Youpeng, et al.
Published: (2024) -
ALISE: Accelerating Large Language Model Serving with Speculative Scheduling
by: Zhao, Youpeng, et al.
Published: (2024) -
GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving
by: Jayakody, Shakya, et al.
Published: (2026) -
The Race to Efficiency: A New Perspective on AI Scaling Laws
by: Lu, Chien-Ping
Published: (2025) -
Time is Not Compute: Scaling Laws for Wall-Clock Constrained Training on Consumer GPUs
by: Liu, Yi
Published: (2026)