Saved in:
Bibliographic Details
Main Authors: Chakraborty, Jayjeet, Dorier, Matthieu, Carns, Philip, Ross, Robert, Maltzahn, Carlos, Litz, Heiner
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2412.02192
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913595439185920
author Chakraborty, Jayjeet
Dorier, Matthieu
Carns, Philip
Ross, Robert
Maltzahn, Carlos
Litz, Heiner
author_facet Chakraborty, Jayjeet
Dorier, Matthieu
Carns, Philip
Ross, Robert
Maltzahn, Carlos
Litz, Heiner
contents The volume of data generated and stored in contemporary global data centers is experiencing exponential growth. This rapid data growth necessitates efficient processing and analysis to extract valuable business insights. In distributed data processing systems, data undergoes exchanges between the compute servers that contribute significantly to the total data processing duration in adequately large clusters, necessitating efficient data transport protocols. Traditionally, data transport frameworks such as JDBC and ODBC have used TCP/IP-over-Ethernet as their underlying network protocol. Such frameworks require serializing the data into a single contiguous buffer before handing it off to the network card, primarily due to the requirement of contiguous data in TCP/IP. In OLAP use cases, this serialization process is costly for columnar data batches as it involves numerous memory copies that hurt data transport duration and overall data processing performance. We study the serialization overhead in the context of a widely-used columnar data format, Apache Arrow, and propose leveraging RDMA to transport Arrow data over Infiniband in a zero-copy manner. We design and implement Thallus, an RDMA-based columnar data transport protocol for Apache Arrow based on the Thallium framework from the Mochi ecosystem, compare it with a purely Thallium RPC-based implementation, and show substantial performance improvements can be achieved by using RDMA for columnar data transport.
format Preprint
id arxiv_https___arxiv_org_abs_2412_02192
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Thallus: An RDMA-based Columnar Data Transport Protocol
Chakraborty, Jayjeet
Dorier, Matthieu
Carns, Philip
Ross, Robert
Maltzahn, Carlos
Litz, Heiner
Distributed, Parallel, and Cluster Computing
Databases
Operating Systems
The volume of data generated and stored in contemporary global data centers is experiencing exponential growth. This rapid data growth necessitates efficient processing and analysis to extract valuable business insights. In distributed data processing systems, data undergoes exchanges between the compute servers that contribute significantly to the total data processing duration in adequately large clusters, necessitating efficient data transport protocols. Traditionally, data transport frameworks such as JDBC and ODBC have used TCP/IP-over-Ethernet as their underlying network protocol. Such frameworks require serializing the data into a single contiguous buffer before handing it off to the network card, primarily due to the requirement of contiguous data in TCP/IP. In OLAP use cases, this serialization process is costly for columnar data batches as it involves numerous memory copies that hurt data transport duration and overall data processing performance. We study the serialization overhead in the context of a widely-used columnar data format, Apache Arrow, and propose leveraging RDMA to transport Arrow data over Infiniband in a zero-copy manner. We design and implement Thallus, an RDMA-based columnar data transport protocol for Apache Arrow based on the Thallium framework from the Mochi ecosystem, compare it with a purely Thallium RPC-based implementation, and show substantial performance improvements can be achieved by using RDMA for columnar data transport.
title Thallus: An RDMA-based Columnar Data Transport Protocol
topic Distributed, Parallel, and Cluster Computing
Databases
Operating Systems
url https://arxiv.org/abs/2412.02192