Saved in:
Bibliographic Details
Main Authors: Liu, Yuanjian, Luo, Huihao, Han, Zhijun, Hu, Yao, Yang, Yehui, Chard, Kyle, Di, Sheng, Foster, Ian, Wu, Jiesheng
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2404.02163
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916189701144576
author Liu, Yuanjian
Luo, Huihao
Han, Zhijun
Hu, Yao
Yang, Yehui
Chard, Kyle
Di, Sheng
Foster, Ian
Wu, Jiesheng
author_facet Liu, Yuanjian
Luo, Huihao
Han, Zhijun
Hu, Yao
Yang, Yehui
Chard, Kyle
Di, Sheng
Foster, Ian
Wu, Jiesheng
contents Storing and archiving data produced by next-generation sequencing (NGS) is a huge burden for research institutions. Reference-based compression algorithms are effective in dealing with these data. Our work focuses on compressing FASTQ format files with an improved reference-based compression algorithm to achieve a higher compression ratio than other state-of-the-art algorithms. We propose FastqZip, which uses a new method mapping the sequence to reference for compression, allows reads-reordering and lossy quality scores, and the BSC or ZPAQ algorithm to perform final lossless compression for a higher compression ratio and relatively fast speed. Our method ensures the sequence can be losslessly reconstructed while allowing lossless or lossy compression for the quality scores. We reordered the reads to get a higher compression ratio. We evaluate our algorithms on five datasets and show that FastqZip can outperform the SOTA algorithm Genozip by around 10% in terms of compression ratio while having an acceptable slowdown.
format Preprint
id arxiv_https___arxiv_org_abs_2404_02163
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle FastqZip: An Improved Reference-Based Genome Sequence Lossy Compression Framework
Liu, Yuanjian
Luo, Huihao
Han, Zhijun
Hu, Yao
Yang, Yehui
Chard, Kyle
Di, Sheng
Foster, Ian
Wu, Jiesheng
Information Theory
Storing and archiving data produced by next-generation sequencing (NGS) is a huge burden for research institutions. Reference-based compression algorithms are effective in dealing with these data. Our work focuses on compressing FASTQ format files with an improved reference-based compression algorithm to achieve a higher compression ratio than other state-of-the-art algorithms. We propose FastqZip, which uses a new method mapping the sequence to reference for compression, allows reads-reordering and lossy quality scores, and the BSC or ZPAQ algorithm to perform final lossless compression for a higher compression ratio and relatively fast speed. Our method ensures the sequence can be losslessly reconstructed while allowing lossless or lossy compression for the quality scores. We reordered the reads to get a higher compression ratio. We evaluate our algorithms on five datasets and show that FastqZip can outperform the SOTA algorithm Genozip by around 10% in terms of compression ratio while having an acceptable slowdown.
title FastqZip: An Improved Reference-Based Genome Sequence Lossy Compression Framework
topic Information Theory
url https://arxiv.org/abs/2404.02163