Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zimmerer, Andreas, Dam, Damien, Kossmann, Jan, Waack, Juliane, Oukid, Ismail, Kipf, Andreas
Format:	Preprint
Published:	2025
Subjects:	Databases
Online Access:	https://arxiv.org/abs/2504.11540
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913903077752832
author	Zimmerer, Andreas Dam, Damien Kossmann, Jan Waack, Juliane Oukid, Ismail Kipf, Andreas
author_facet	Zimmerer, Andreas Dam, Damien Kossmann, Jan Waack, Juliane Oukid, Ismail Kipf, Andreas
contents	Modern cloud-based data analytics systems must efficiently process petabytes of data residing on cloud storage. A key optimization technique in state-of-the-art systems like Snowflake is partition pruning - skipping chunks of data that do not contain relevant information for computing query results. While partition pruning based on query predicates is a well-established technique, we present new pruning techniques that extend the scope of partition pruning to LIMIT, top-k, and JOIN operations, significantly expanding the opportunities for pruning across diverse query types. We detail the implementation of each method and examine their impact on real-world workloads. Our analysis of Snowflake's production workloads reveals that real-world analytical queries exhibit much higher selectivity than commonly assumed, yielding effective partition pruning and highlighting the need for more realistic benchmarks. We show that we can harness high selectivity by utilizing min/max metadata available in modern data analytics systems and data lake formats like Apache Iceberg, reducing the number of processed micro-partitions by 99.4% across the Snowflake data platform.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_11540
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Pruning in Snowflake: Working Smarter, Not Harder Zimmerer, Andreas Dam, Damien Kossmann, Jan Waack, Juliane Oukid, Ismail Kipf, Andreas Databases Modern cloud-based data analytics systems must efficiently process petabytes of data residing on cloud storage. A key optimization technique in state-of-the-art systems like Snowflake is partition pruning - skipping chunks of data that do not contain relevant information for computing query results. While partition pruning based on query predicates is a well-established technique, we present new pruning techniques that extend the scope of partition pruning to LIMIT, top-k, and JOIN operations, significantly expanding the opportunities for pruning across diverse query types. We detail the implementation of each method and examine their impact on real-world workloads. Our analysis of Snowflake's production workloads reveals that real-world analytical queries exhibit much higher selectivity than commonly assumed, yielding effective partition pruning and highlighting the need for more realistic benchmarks. We show that we can harness high selectivity by utilizing min/max metadata available in modern data analytics systems and data lake formats like Apache Iceberg, reducing the number of processed micro-partitions by 99.4% across the Snowflake data platform.
title	Pruning in Snowflake: Working Smarter, Not Harder
topic	Databases
url	https://arxiv.org/abs/2504.11540

Similar Items