Saved in:
Bibliographic Details
Main Authors: Pace, Weston, She, Chang, Xu, Lei, Jones, Will, Lockett, Albert, Wang, Jun, Shah, Raunak
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2504.15247
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913801280946176
author Pace, Weston
She, Chang
Xu, Lei
Jones, Will
Lockett, Albert
Wang, Jun
Shah, Raunak
author_facet Pace, Weston
She, Chang
Xu, Lei
Jones, Will
Lockett, Albert
Wang, Jun
Shah, Raunak
contents The growing interest in artificial intelligence has created workloads that require both sequential and random access. At the same time, NVMe-backed storage solutions have emerged, providing caching capability for large columnar datasets in cloud storage. Current columnar storage libraries fall short of effectively utilizing an NVMe device's capabilities, especially when it comes to random access. Historically, this has been assumed an implicit weakness in columnar storage formats, but this has not been sufficiently explored. In this paper, we examine the effectiveness of popular columnar formats such as Apache Arrow, Apache Parquet, and Lance in both random access and full scan tasks against NVMe storage. We argue that effective encoding of a column's structure, such as the repetition and validity information, is the key to unlocking the disk's performance. We show that Parquet, when configured correctly, can achieve over 60x better random access performance than default settings. We also show that this high random access performance requires making minor trade-offs in scan performance and RAM utilization. We then describe the Lance structural encoding scheme, which alternates between two different structural encodings based on data width, and achieves better random access performance without making trade-offs in scan performance or RAM utilization.
format Preprint
id arxiv_https___arxiv_org_abs_2504_15247
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Lance: Efficient Random Access in Columnar Storage through Adaptive Structural Encodings
Pace, Weston
She, Chang
Xu, Lei
Jones, Will
Lockett, Albert
Wang, Jun
Shah, Raunak
Databases
H.3.2
The growing interest in artificial intelligence has created workloads that require both sequential and random access. At the same time, NVMe-backed storage solutions have emerged, providing caching capability for large columnar datasets in cloud storage. Current columnar storage libraries fall short of effectively utilizing an NVMe device's capabilities, especially when it comes to random access. Historically, this has been assumed an implicit weakness in columnar storage formats, but this has not been sufficiently explored. In this paper, we examine the effectiveness of popular columnar formats such as Apache Arrow, Apache Parquet, and Lance in both random access and full scan tasks against NVMe storage. We argue that effective encoding of a column's structure, such as the repetition and validity information, is the key to unlocking the disk's performance. We show that Parquet, when configured correctly, can achieve over 60x better random access performance than default settings. We also show that this high random access performance requires making minor trade-offs in scan performance and RAM utilization. We then describe the Lance structural encoding scheme, which alternates between two different structural encodings based on data width, and achieves better random access performance without making trade-offs in scan performance or RAM utilization.
title Lance: Efficient Random Access in Columnar Storage through Adaptive Structural Encodings
topic Databases
H.3.2
url https://arxiv.org/abs/2504.15247