Saved in:
Bibliographic Details
Main Authors: Prabhu, Ritvik, Vatai, Emil, Moussad, Bernard, Jeannot, Emmanuel, Anandakrishnan, Ramu, Feng, Wu-chun, Wahib, Mohamed
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.16721
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912971178901504
author Prabhu, Ritvik
Vatai, Emil
Moussad, Bernard
Jeannot, Emmanuel
Anandakrishnan, Ramu
Feng, Wu-chun
Wahib, Mohamed
author_facet Prabhu, Ritvik
Vatai, Emil
Moussad, Bernard
Jeannot, Emmanuel
Anandakrishnan, Ramu
Feng, Wu-chun
Wahib, Mohamed
contents Cancer typically arises not from a single genetic mutation (i.e., hit) but from multi-hit combinations that accumulate within cells. However, enumerating multi-hit combinations becomes exponentially more expensive computationally as the number of candidate hit gene combinations grow, i.e. on the order of 20,000 choose h, where 20,000 is the number of genes in the human genome and h is the number of hits. To address this challenge, we present an algorithmic framework, called Pruned Depth-First Search (P-DFS) that leverages the high sparsity in tumor mutation data to prune large portions of the search space. Specifically, P-DFS (the main contribution of this paper) - a pruning technique that exploits sparsity to drastically reduce the otherwise exponential h-hit search space for candidate combinations used by Weighted Set Cover - which is grounded in a depth-first search backtracking technique, prunes infeasible gene subsets early, while a weighted set cover formulation systematically scores and selects the most discriminative combinations. By intertwining these ideas with optimized bitwise operations and a scalable distributed algorithm on high-performance computing clusters, our algorithm can achieve approximately 90 - 98% reduction in visited combinations for 4-hits, and roughly a 183x speedup over the exhaustive set cover approach(which is algorithmically NP-complete) measured on 147,456 ranks. In doing so, our method can feasibly handle four-hit and even higher-order gene hits, achieving both speed and resource efficiency.
format Preprint
id arxiv_https___arxiv_org_abs_2603_16721
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Looking for (Genomic) Needles in a Haystack: Sparsity-Driven Search for Identifying Correlated Genetic Mutations in Cancer
Prabhu, Ritvik
Vatai, Emil
Moussad, Bernard
Jeannot, Emmanuel
Anandakrishnan, Ramu
Feng, Wu-chun
Wahib, Mohamed
Distributed, Parallel, and Cluster Computing
Cancer typically arises not from a single genetic mutation (i.e., hit) but from multi-hit combinations that accumulate within cells. However, enumerating multi-hit combinations becomes exponentially more expensive computationally as the number of candidate hit gene combinations grow, i.e. on the order of 20,000 choose h, where 20,000 is the number of genes in the human genome and h is the number of hits. To address this challenge, we present an algorithmic framework, called Pruned Depth-First Search (P-DFS) that leverages the high sparsity in tumor mutation data to prune large portions of the search space. Specifically, P-DFS (the main contribution of this paper) - a pruning technique that exploits sparsity to drastically reduce the otherwise exponential h-hit search space for candidate combinations used by Weighted Set Cover - which is grounded in a depth-first search backtracking technique, prunes infeasible gene subsets early, while a weighted set cover formulation systematically scores and selects the most discriminative combinations. By intertwining these ideas with optimized bitwise operations and a scalable distributed algorithm on high-performance computing clusters, our algorithm can achieve approximately 90 - 98% reduction in visited combinations for 4-hits, and roughly a 183x speedup over the exhaustive set cover approach(which is algorithmically NP-complete) measured on 147,456 ranks. In doing so, our method can feasibly handle four-hit and even higher-order gene hits, achieving both speed and resource efficiency.
title Looking for (Genomic) Needles in a Haystack: Sparsity-Driven Search for Identifying Correlated Genetic Mutations in Cancer
topic Distributed, Parallel, and Cluster Computing
url https://arxiv.org/abs/2603.16721