Saved in:
| Main Authors: | Guha, Etash, Jiang, Tianxiao, Deng, Andrew, Zhang, Jian, Annamalai, Muthu |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.01872 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
TileLoom: Automatic Dataflow Planning for Tile-Based Languages on Spatial Dataflow Accelerators
by: Li, Wei, et al.
Published: (2025)
by: Li, Wei, et al.
Published: (2025)
Suki: Choreographed Distributed Dataflow in Rust
by: Laddad, Shadaj, et al.
Published: (2024)
by: Laddad, Shadaj, et al.
Published: (2024)
Failure Transparency in Stateful Dataflow Systems (Technical Report)
by: Veresov, Aleksey, et al.
Published: (2024)
by: Veresov, Aleksey, et al.
Published: (2024)
Scaling Deep Learning Training with MPMD Pipeline Parallelism
by: Xhebraj, Anxhelo, et al.
Published: (2024)
by: Xhebraj, Anxhelo, et al.
Published: (2024)
LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers
by: Merouani, Massinissa, et al.
Published: (2024)
by: Merouani, Massinissa, et al.
Published: (2024)
PartIR: Composing SPMD Partitioning Strategies for Machine Learning
by: Alabed, Sami, et al.
Published: (2024)
by: Alabed, Sami, et al.
Published: (2024)
Data-efficient Performance Modeling via Pre-training
by: Liu, Chunting, et al.
Published: (2025)
by: Liu, Chunting, et al.
Published: (2025)
Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs
by: Cheng, Xinhao, et al.
Published: (2025)
by: Cheng, Xinhao, et al.
Published: (2025)
Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel
by: Jin, Hongyi, et al.
Published: (2026)
by: Jin, Hongyi, et al.
Published: (2026)
COSTREAM: Learned Cost Models for Operator Placement in Edge-Cloud Environments
by: Heinrich, Roman, et al.
Published: (2024)
by: Heinrich, Roman, et al.
Published: (2024)
Theoretical Foundations of GPU-Native Compilation for Rapid Code Iteration
by: Metinov, Adilet, et al.
Published: (2025)
by: Metinov, Adilet, et al.
Published: (2025)
VTC: DNN Compilation with Virtual Tensors for Data Movement Elimination
by: Hu, Muyan, et al.
Published: (2026)
by: Hu, Muyan, et al.
Published: (2026)
Morphling: Fast, Fused, and Flexible GNN Training at Scale
by: Anubhab, et al.
Published: (2025)
by: Anubhab, et al.
Published: (2025)
GPU-Accelerated Synthesis of Mixed-Boolean Arithmetic: Beyond Caching
by: Bathie, Gabriel, et al.
Published: (2026)
by: Bathie, Gabriel, et al.
Published: (2026)
veScale: Consistent and Efficient Tensor Programming with Eager-Mode SPMD
by: Li, Youjie, et al.
Published: (2025)
by: Li, Youjie, et al.
Published: (2025)
DOPPLER: Dual-Policy Learning for Device Assignment in Asynchronous Dataflow Graphs
by: Yao, Xinyu, et al.
Published: (2025)
by: Yao, Xinyu, et al.
Published: (2025)
Integrated Hardware Architecture and Device Placement Search
by: Wang, Irene, et al.
Published: (2024)
by: Wang, Irene, et al.
Published: (2024)
Agentic Auto-Scheduling: An Experimental Study of LLM-Guided Loop Optimization
by: Merouani, Massinissa, et al.
Published: (2025)
by: Merouani, Massinissa, et al.
Published: (2025)
Accelerating Recommender Model ETL with a Streaming FPGA-GPU Dataflow
by: Zhu, Yu, et al.
Published: (2025)
by: Zhu, Yu, et al.
Published: (2025)
Reward Augmentation in Reinforcement Learning for Testing Distributed Systems
by: Borgarelli, Andrea, et al.
Published: (2024)
by: Borgarelli, Andrea, et al.
Published: (2024)
Axe: A Simple Unified Layout Abstraction for Machine Learning Compilers
by: Hou, Bohan, et al.
Published: (2026)
by: Hou, Bohan, et al.
Published: (2026)
Energy-Efficient Split Learning for Fine-Tuning Large Language Models in Edge Networks
by: Li, Zuguang, et al.
Published: (2024)
by: Li, Zuguang, et al.
Published: (2024)
Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference
by: Yang, Mengtian, et al.
Published: (2026)
by: Yang, Mengtian, et al.
Published: (2026)
Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding
by: Jin, Tian, et al.
Published: (2025)
by: Jin, Tian, et al.
Published: (2025)
NEST: Network- and Memory-Aware Device Placement For Distributed Deep Learning
by: Wang, Irene, et al.
Published: (2026)
by: Wang, Irene, et al.
Published: (2026)
Graph Neural Networks and Reinforcement Learning for Proactive Application Image Placement
by: Makris, Antonios, et al.
Published: (2024)
by: Makris, Antonios, et al.
Published: (2024)
Scalable Training of Mixture-of-Experts Models with Megatron Core
by: Yan, Zijie, et al.
Published: (2026)
by: Yan, Zijie, et al.
Published: (2026)
CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization
by: Zhang, Zijian, et al.
Published: (2025)
by: Zhang, Zijian, et al.
Published: (2025)
Optimizing Cross-Client Domain Coverage for Federated Instruction Tuning of Large Language Models
by: Wang, Zezhou, et al.
Published: (2024)
by: Wang, Zezhou, et al.
Published: (2024)
Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization
by: Che, Tianshi, et al.
Published: (2023)
by: Che, Tianshi, et al.
Published: (2023)
CAFL-L: Constraint-Aware Federated Learning with Lagrangian Dual Optimization for On-Device Language Models
by: Zheng, Dongqi, et al.
Published: (2025)
by: Zheng, Dongqi, et al.
Published: (2025)
FedBiOT: LLM Local Fine-tuning in Federated Learning without Full Model
by: Wu, Feijie, et al.
Published: (2024)
by: Wu, Feijie, et al.
Published: (2024)
MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
by: Li, Wenxuan, et al.
Published: (2025)
by: Li, Wenxuan, et al.
Published: (2025)
InferCept: Efficient Intercept Support for Augmented Large Language Model Inference
by: Abhyankar, Reyna, et al.
Published: (2024)
by: Abhyankar, Reyna, et al.
Published: (2024)
Lobster: A GPU-Accelerated Framework for Neurosymbolic Programming
by: Biberstein, Paul, et al.
Published: (2025)
by: Biberstein, Paul, et al.
Published: (2025)
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation
by: Jin, Chao, et al.
Published: (2024)
by: Jin, Chao, et al.
Published: (2024)
Unlocking Full Efficiency of Token Filtering in Large Language Model Training
by: Chai, Di, et al.
Published: (2025)
by: Chai, Di, et al.
Published: (2025)
Towards Resiliency in Large Language Model Serving with KevlarFlow
by: Qian, Shangshu, et al.
Published: (2026)
by: Qian, Shangshu, et al.
Published: (2026)
Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes
by: Qin, Zhen, et al.
Published: (2023)
by: Qin, Zhen, et al.
Published: (2023)
Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture
by: Elbtity, Mohammed, et al.
Published: (2024)
by: Elbtity, Mohammed, et al.
Published: (2024)
Similar Items
-
TileLoom: Automatic Dataflow Planning for Tile-Based Languages on Spatial Dataflow Accelerators
by: Li, Wei, et al.
Published: (2025) -
Suki: Choreographed Distributed Dataflow in Rust
by: Laddad, Shadaj, et al.
Published: (2024) -
Failure Transparency in Stateful Dataflow Systems (Technical Report)
by: Veresov, Aleksey, et al.
Published: (2024) -
Scaling Deep Learning Training with MPMD Pipeline Parallelism
by: Xhebraj, Anxhelo, et al.
Published: (2024) -
LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers
by: Merouani, Massinissa, et al.
Published: (2024)