Saved in:
| Main Authors: | Shahout, Rana, Liang, Cong, Xin, Shiji, Lao, Qianru, Cui, Yong, Yu, Minlan, Mitzenmacher, Michael |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.18248 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
From Score Distributions to Balance: Plug-and-Play Mixture-of-Experts Routing
by: Shahout, Rana, et al.
Published: (2025)
by: Shahout, Rana, et al.
Published: (2025)
Orla: A Library for Serving LLM-Based Multi-Agent Systems
by: Shahout, Rana, et al.
Published: (2026)
by: Shahout, Rana, et al.
Published: (2026)
Intra-request branch orchestration for efficient LLM reasoning
by: Jiang, Weifan, et al.
Published: (2025)
by: Jiang, Weifan, et al.
Published: (2025)
Queueing, Predictions, and LLMs: Challenges and Open Problems
by: Mitzenmacher, Michael, et al.
Published: (2025)
by: Mitzenmacher, Michael, et al.
Published: (2025)
Learning-Based Heavy Hitters and Flow Frequency Estimation in Streams
by: Shahout, Rana, et al.
Published: (2024)
by: Shahout, Rana, et al.
Published: (2024)
Learning-Augmented Frequency Estimation in Sliding Windows
by: Shahout, Rana, et al.
Published: (2024)
by: Shahout, Rana, et al.
Published: (2024)
SkipPredict: When to Invest in Predictions for Scheduling
by: Shahout, Rana, et al.
Published: (2024)
by: Shahout, Rana, et al.
Published: (2024)
Don't Stop Me Now: Embedding Based Scheduling for LLMs
by: Shahout, Rana, et al.
Published: (2024)
by: Shahout, Rana, et al.
Published: (2024)
THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression
by: Li, Minghao, et al.
Published: (2023)
by: Li, Minghao, et al.
Published: (2023)
Predictive Scheduling for Efficient Inference-Time Reasoning in Large Language Models
by: Brown, Katrina, et al.
Published: (2026)
by: Brown, Katrina, et al.
Published: (2026)
PALS: Power-Aware LLM Serving for Mixture-of-Experts Models
by: Hankendi, Can, et al.
Published: (2026)
by: Hankendi, Can, et al.
Published: (2026)
EdgeSight: Enabling Modeless and Cost-Efficient Inference at the Edge
by: Lao, ChonLam, et al.
Published: (2024)
by: Lao, ChonLam, et al.
Published: (2024)
Learning Multimodal Energy-Based Model with Multimodal Variational Auto-Encoder via MCMC Revision
by: Cui, Jiali, et al.
Published: (2026)
by: Cui, Jiali, et al.
Published: (2026)
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
by: Yue, Yang, et al.
Published: (2024)
by: Yue, Yang, et al.
Published: (2024)
Conda: Column-Normalized Adam for Training Large Language Models Faster
by: Wang, Junjie, et al.
Published: (2025)
by: Wang, Junjie, et al.
Published: (2025)
Federated Learning Clients Clustering with Adaptation to Data Drifts
by: Li, Minghao, et al.
Published: (2024)
by: Li, Minghao, et al.
Published: (2024)
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
by: Jiang, Xuanlin, et al.
Published: (2024)
by: Jiang, Xuanlin, et al.
Published: (2024)
An LLM-based Agentic Framework for Accessible Network Control
by: Lin, Samuel, et al.
Published: (2025)
by: Lin, Samuel, et al.
Published: (2025)
FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference
by: Bajpai, Divya Jyoti, et al.
Published: (2025)
by: Bajpai, Divya Jyoti, et al.
Published: (2025)
SQLBarber: A System Leveraging Large Language Models to Generate Customized and Realistic SQL Workloads
by: Lao, Jiale, et al.
Published: (2025)
by: Lao, Jiale, et al.
Published: (2025)
UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models
by: Lin, Huawei, et al.
Published: (2025)
by: Lin, Huawei, et al.
Published: (2025)
WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference
by: Chen, Sihan, et al.
Published: (2025)
by: Chen, Sihan, et al.
Published: (2025)
Model-Distributed Inference for Large Language Models at the Edge
by: Macario, Davide, et al.
Published: (2025)
by: Macario, Davide, et al.
Published: (2025)
Fast Large Language Model Collaborative Decoding via Speculation
by: Fu, Jiale, et al.
Published: (2025)
by: Fu, Jiale, et al.
Published: (2025)
HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference
by: Zhang, Zeyu, et al.
Published: (2025)
by: Zhang, Zeyu, et al.
Published: (2025)
Advancing Tool-Augmented Large Language Models: Integrating Insights from Errors in Inference Trees
by: Chen, Sijia, et al.
Published: (2024)
by: Chen, Sijia, et al.
Published: (2024)
Over-Searching in Search-Augmented Large Language Models
by: Xie, Roy, et al.
Published: (2026)
by: Xie, Roy, et al.
Published: (2026)
The Shape of Wisdom: Decision Trajectories in Language Models
by: Rana, Shailesh
Published: (2026)
by: Rana, Shailesh
Published: (2026)
LifeAlign: Lifelong Alignment for Large Language Models with Memory-Augmented Focalized Preference Optimization
by: Li, Junsong, et al.
Published: (2025)
by: Li, Junsong, et al.
Published: (2025)
Block Transformer: Global-to-Local Language Modeling for Fast Inference
by: Ho, Namgyu, et al.
Published: (2024)
by: Ho, Namgyu, et al.
Published: (2024)
A Large Recurrent Action Model: xLSTM enables Fast Inference for Robotics Tasks
by: Schmied, Thomas, et al.
Published: (2024)
by: Schmied, Thomas, et al.
Published: (2024)
Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs
by: Feng, Yushi, et al.
Published: (2025)
by: Feng, Yushi, et al.
Published: (2025)
Do Language Models Have Bayesian Brains? Distinguishing Stochastic and Deterministic Decision Patterns within Large Language Models
by: Cui, Andrea Yaoyun, et al.
Published: (2025)
by: Cui, Andrea Yaoyun, et al.
Published: (2025)
Fate: Fast Edge Inference of Mixture-of-Experts Models via Cross-Layer Gate
by: Fang, Zhiyuan, et al.
Published: (2025)
by: Fang, Zhiyuan, et al.
Published: (2025)
Efficient Large Language Model Inference with Neural Block Linearization
by: Erdogan, Mete, et al.
Published: (2025)
by: Erdogan, Mete, et al.
Published: (2025)
Automatic Calibration for Membership Inference Attack on Large Language Models
by: Zade, Saleh Zare, et al.
Published: (2025)
by: Zade, Saleh Zare, et al.
Published: (2025)
Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy
by: Zhao, Yao, et al.
Published: (2023)
by: Zhao, Yao, et al.
Published: (2023)
FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction
by: Cai, Yuxuan, et al.
Published: (2025)
by: Cai, Yuxuan, et al.
Published: (2025)
Theoretical Modeling of Large Language Model Self-Improvement Training Dynamics Through Solver-Verifier Gap
by: Sun, Yifan, et al.
Published: (2025)
by: Sun, Yifan, et al.
Published: (2025)
Consistency Models for Scalable and Fast Simulation-Based Inference
by: Schmitt, Marvin, et al.
Published: (2023)
by: Schmitt, Marvin, et al.
Published: (2023)
Similar Items
-
From Score Distributions to Balance: Plug-and-Play Mixture-of-Experts Routing
by: Shahout, Rana, et al.
Published: (2025) -
Orla: A Library for Serving LLM-Based Multi-Agent Systems
by: Shahout, Rana, et al.
Published: (2026) -
Intra-request branch orchestration for efficient LLM reasoning
by: Jiang, Weifan, et al.
Published: (2025) -
Queueing, Predictions, and LLMs: Challenges and Open Problems
by: Mitzenmacher, Michael, et al.
Published: (2025) -
Learning-Based Heavy Hitters and Flow Frequency Estimation in Streams
by: Shahout, Rana, et al.
Published: (2024)