Saved in:
Bibliographic Details
Main Authors: Zhang, Zhonghai, Li, Yewen, Meng, Ke, Zhang, Chunming, Tan, Guangming
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.06127
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Duplicate marking is a critical preprocessing step in gene sequence analysis to flag redundant reads arising from polymerase chain reaction(PCR) amplification and sequencing artifacts. Although Picard MarkDuplicates is widely recognized as the gold-standard tool, its single-threaded implementation and reliance on global sorting result in significant computational and resource overhead, limiting its efficiency on large-scale datasets. Here, we introduce FastDup: a high-performance, scalable solution that follows the speculation-and-test mechanism. FastDup achieves up to 20x throughput speedup and guarantees 100\% identical output compared to Picard MarkDuplicates. FastDup is a C++ program available from GitHub (https://github.com/zzhofict/FastDup.git) under the MIT license.