:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Guha, Etash, Jiang, Tianxiao, Deng, Andrew, Zhang, Jian, Annamalai, Muthu
Format:	Preprint
Published:	2025
Subjects:	Distributed, Parallel, and Cluster Computing Machine Learning Programming Languages
Online Access:	https://arxiv.org/abs/2511.01872
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

TileLoom: Automatic Dataflow Planning for Tile-Based Languages on Spatial Dataflow Accelerators
by: Li, Wei, et al.
Published: (2025)

Suki: Choreographed Distributed Dataflow in Rust
by: Laddad, Shadaj, et al.
Published: (2024)

Failure Transparency in Stateful Dataflow Systems (Technical Report)
by: Veresov, Aleksey, et al.
Published: (2024)

Scaling Deep Learning Training with MPMD Pipeline Parallelism
by: Xhebraj, Anxhelo, et al.
Published: (2024)

LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers
by: Merouani, Massinissa, et al.
Published: (2024)

PartIR: Composing SPMD Partitioning Strategies for Machine Learning
by: Alabed, Sami, et al.
Published: (2024)

Data-efficient Performance Modeling via Pre-training
by: Liu, Chunting, et al.
Published: (2025)

Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs
by: Cheng, Xinhao, et al.
Published: (2025)

Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel
by: Jin, Hongyi, et al.
Published: (2026)

COSTREAM: Learned Cost Models for Operator Placement in Edge-Cloud Environments
by: Heinrich, Roman, et al.
Published: (2024)

Theoretical Foundations of GPU-Native Compilation for Rapid Code Iteration
by: Metinov, Adilet, et al.
Published: (2025)

VTC: DNN Compilation with Virtual Tensors for Data Movement Elimination
by: Hu, Muyan, et al.
Published: (2026)

Morphling: Fast, Fused, and Flexible GNN Training at Scale
by: Anubhab, et al.
Published: (2025)

GPU-Accelerated Synthesis of Mixed-Boolean Arithmetic: Beyond Caching
by: Bathie, Gabriel, et al.
Published: (2026)

veScale: Consistent and Efficient Tensor Programming with Eager-Mode SPMD
by: Li, Youjie, et al.
Published: (2025)

DOPPLER: Dual-Policy Learning for Device Assignment in Asynchronous Dataflow Graphs
by: Yao, Xinyu, et al.
Published: (2025)

Integrated Hardware Architecture and Device Placement Search
by: Wang, Irene, et al.
Published: (2024)

Agentic Auto-Scheduling: An Experimental Study of LLM-Guided Loop Optimization
by: Merouani, Massinissa, et al.
Published: (2025)

Accelerating Recommender Model ETL with a Streaming FPGA-GPU Dataflow
by: Zhu, Yu, et al.
Published: (2025)

Reward Augmentation in Reinforcement Learning for Testing Distributed Systems
by: Borgarelli, Andrea, et al.
Published: (2024)

Axe: A Simple Unified Layout Abstraction for Machine Learning Compilers
by: Hou, Bohan, et al.
Published: (2026)

Energy-Efficient Split Learning for Fine-Tuning Large Language Models in Edge Networks
by: Li, Zuguang, et al.
Published: (2024)

Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference
by: Yang, Mengtian, et al.
Published: (2026)

Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding
by: Jin, Tian, et al.
Published: (2025)

NEST: Network- and Memory-Aware Device Placement For Distributed Deep Learning
by: Wang, Irene, et al.
Published: (2026)

Graph Neural Networks and Reinforcement Learning for Proactive Application Image Placement
by: Makris, Antonios, et al.
Published: (2024)

Scalable Training of Mixture-of-Experts Models with Megatron Core
by: Yan, Zijie, et al.
Published: (2026)

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization
by: Zhang, Zijian, et al.
Published: (2025)

Optimizing Cross-Client Domain Coverage for Federated Instruction Tuning of Large Language Models
by: Wang, Zezhou, et al.
Published: (2024)

Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization
by: Che, Tianshi, et al.
Published: (2023)

CAFL-L: Constraint-Aware Federated Learning with Lagrangian Dual Optimization for On-Device Language Models
by: Zheng, Dongqi, et al.
Published: (2025)

FedBiOT: LLM Local Fine-tuning in Federated Learning without Full Model
by: Wu, Feijie, et al.
Published: (2024)

MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
by: Li, Wenxuan, et al.
Published: (2025)

InferCept: Efficient Intercept Support for Augmented Large Language Model Inference
by: Abhyankar, Reyna, et al.
Published: (2024)

Lobster: A GPU-Accelerated Framework for Neurosymbolic Programming
by: Biberstein, Paul, et al.
Published: (2025)

RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation
by: Jin, Chao, et al.
Published: (2024)

Unlocking Full Efficiency of Token Filtering in Large Language Model Training
by: Chai, Di, et al.
Published: (2025)

Towards Resiliency in Large Language Model Serving with KevlarFlow
by: Qian, Shangshu, et al.
Published: (2026)

Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes
by: Qin, Zhen, et al.
Published: (2023)

Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture
by: Elbtity, Mohammed, et al.
Published: (2024)