Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Graziano, Marco
Format:	Preprint
Published:	2026
Subjects:	Hardware Architecture Artificial Intelligence Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2603.10030
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917362056298496
author	Graziano, Marco
author_facet	Graziano, Marco
contents	AI transport libraries move bytes efficiently, but they commonly assume that buffers are already correctly allocated, placed, shared, registered, and safe under completion and teardown pressure. This paper presents dmaplane, a Linux kernel module that makes this missing layer explicit as buffer orchestration. dmaplane exposes a stable kernel UAPI via /dev/dmaplane and composes ring-based command channels, DMA buffer lifecycle management, dma-buf export for cross-device sharing, a kernel-space RDMA engine, NUMA-aware allocation and verification, credit-based flow control, low-overhead observability, and GPU memory integration via PCIe BAR pinning. We evaluate orchestration sensitivity with measurements of NUMA cross-node penalties at DRAM scale, completion-safe flow control under sustained RDMA load, and GPU BAR mapping tiers versus cudaMemcpy. We also demonstrate end-to-end disaggregated inference by transferring KV-cache chunks between two machines using RDMA WRITE WITH IMMEDIATE and reconstructing tensor views on the receiver. RDMA measurements use Soft-RoCE; we distinguish measured results from provider-independent properties by construction.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_10030
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	The DMA Streaming Framework: Kernel-Level Buffer Orchestration for High-Performance AI Data Paths Graziano, Marco Hardware Architecture Artificial Intelligence Distributed, Parallel, and Cluster Computing AI transport libraries move bytes efficiently, but they commonly assume that buffers are already correctly allocated, placed, shared, registered, and safe under completion and teardown pressure. This paper presents dmaplane, a Linux kernel module that makes this missing layer explicit as buffer orchestration. dmaplane exposes a stable kernel UAPI via /dev/dmaplane and composes ring-based command channels, DMA buffer lifecycle management, dma-buf export for cross-device sharing, a kernel-space RDMA engine, NUMA-aware allocation and verification, credit-based flow control, low-overhead observability, and GPU memory integration via PCIe BAR pinning. We evaluate orchestration sensitivity with measurements of NUMA cross-node penalties at DRAM scale, completion-safe flow control under sustained RDMA load, and GPU BAR mapping tiers versus cudaMemcpy. We also demonstrate end-to-end disaggregated inference by transferring KV-cache chunks between two machines using RDMA WRITE WITH IMMEDIATE and reconstructing tensor views on the receiver. RDMA measurements use Soft-RoCE; we distinguish measured results from provider-independent properties by construction.
title	The DMA Streaming Framework: Kernel-Level Buffer Orchestration for High-Performance AI Data Paths
topic	Hardware Architecture Artificial Intelligence Distributed, Parallel, and Cluster Computing
url	https://arxiv.org/abs/2603.10030

Similar Items